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Message from the Program Chair 


Welcome to the 18th LISA Conference. Thank you for attending. 


The Authors, Program Committee, Track Chairs, Readers, and Reviewers have done 
tremendous work to put together a very strong conference. I hope you are well rested; it 
will be a busy and thrilling week for everyone. 


The theme of this year’s conference — ‘‘System Administration Reality — Automation, 
Configuration, and Users” — is meant to focus our attention on the day-to-day issues of 
our profession while at the same time providing an opportunity to look towards the future 
of System Administration and our very active role in its development. 


Perpetual problems such as backups and printing don’t appear to be getting any closer to 
A Solution than they were when this conference started 18 years ago. However, many 
other problems have been resolved or have reasonable solutions proposed for them. 
These pages contain some of those proposed solutions as well as solutions for other 
problems you may not have encountered yet. 


It is the combination of research and day-to-day System Administration that is the 
strength of this conference and its community. We all contribute to the development of 
System Administration as a profession. We all contribute to the tools and tricks of this 
trade. We all have much to be proud of in the work represented in these Proceedings as 
well as that of the previous 17. 


If you hanker for System Administration as a discipline (or profession) with specialized 
information, you need look no further than this tome to see the sort of specialization that 
is emerging. 


These Proceedings from the 2004 LISA Conference contain the complete texts of the 
refereed papers. Of the 70 papers submitted, 22 are presented here. It is always 
challenging for a Program Committee to select papers. Many good papers were not 
accepted due to lack of room or for other reasons. Every such decision is made with 
regret and a hope for resubmission to future conferences. 


I would like to thank everyone who submitted an abstract for consideration. I would 
especially like to thank the authors who have put so much effort into writing these papers. 
Further, I would like to thank all of the people who worked so hard to make this 
conference happen. 


All of us hope that you find these works useful and enjoyable and that you consider 
presenting your own work at future LISA conferences. LISA 2005 is December 4—9th in 
San Diego, California. It is never too early to start writing your paper abstracts. 


Lee Damon 
Program Chair 
LISA 2004 
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Scalable Centralized Bayesian Spam 
Mitigation with Bogofilter 


Jeremy Blosser and David Josephsen — VHA, Inc. 


ABSTRACT 


Bayesian content filters gained popular acclaim when they were put forward in 2002 by Paul 
Graham as a potential long-term solution for the spam problem. They have since fallen from the 
limelight, however, due to perceived attack vulnerabilities inherent to all content-based filters as 
well as real and imagined vulnerabilities specific to Bayesian filters. It has also been assumed that 
Bayesian filters would be problematic to implement in centralized or large environments due to 
wordlist management issues. This paper revisits the effectiveness of Bayesian filters as a 
sustainable singular spam solution for mid- to large-sized environments through a real-world study 
of the deployment and operation of the Bogofilter Robinson-Fisher Bayesian classification utility 
in a production mail environment servicing thousands of accounts. Our implementation strategy 
and methodology as well as our results are described in detail so that they can be evaluated and 
replicated if desired. Other filtering methodologies which were previously implemented in this 
environment are also discussed for comparison purposes, though they have since been removed 
from production due primarily to lack of need. Bayesian classification has been able to solve the 
spam problem for this user population for the present and observable future, with a single wordlist, 
and with no secondary spam filtering techniques employed. Significantly, only two business- 
related legitimate messages have been reported as blocked due to filter misclassification since 


Bogofilter was deployed. 


Introduction 


Unsolicited bulk and commercial email, popu- 
larly known as spam, is one of the most critical issues 
facing systems, mail, and network administrators 
today. More and more human and system resources 
are being dedicated both to dealing with the day-to- 
day filtering of mail and to determining any possible 
long-term solutions to the problem, be they technical, 
social, or legislative. Whatever solutions are applied, 
however, are inevitably subverted or defeated by 
spammers within a short time of implementation, 
causing many to characterize the situation as an arms 
race. Fixed string content filters are avoided by man- 
gling commonly blocked words. MTA blacklists cause 
spammers simply to change their mail routes and ser- 
vice providers and cause excessive collateral damage 
[JAC]. Challenge-response systems have led to spam- 
mers adding mail route harvesting to their existing 
address harvesting practices and cause similar collateral 
damage [SEL]. Message repositories and checksum- 
ming databases are brute force solutions with high false 
negative rates [MER]; in our experience they also 
require persistent high maintenance and ever-increasing 
resource utilization. The most recent attempts to add 
authentication into the mail delivery process through 
extending or replacing SMTP seem likely to be the 
most costly yet in terms of collateral damage and infra- 
structure costs [KNOJ], yet spammers are already 
observably capable of bypassing these measures using 
hijacked end-user machines to send messages using the 
local mail submission system and routing through 
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authenticated channels, in the same way that email 
worms currently propagate [JdeBP]. This is not to say 
that these technical methods are entirely without merit 
or usefulness, as they are all effective in at least block- 
ing some percentage of spam. However, the cost of 
these incremental cures is escalating to the point they 
may become as bad as the disease itself. 


Legislative methods are in their infancy but are 
already being undermined by political pressures and 
seem likely primarily to force spammers to move even 
more of their operations to countries with friendlier 
laws, something they have demonstrated extreme will- 
ingness to do. Diligence on the part of prosecutors 
may yet produce results here, but they are likely to be 
a long time coming. The problem is here today, and its 
current severity can not be overstated. End users who 
are the worst affected are reporting hours spent per 
week deleting and attempting to block unwanted mail 
that is increasingly offensive in nature, and more and 
more are determining the effort of keeping their email 
usable is not worth it and are instead returning to other 
forms of communication. 


When Paul Graham published his 2002 paper “A 
Plan for Spam” [GRA], which proposed applying 
Bayesian statistical modeling as a method of content 
filtering and provided very promising early results from 
Graham’s own tests, many in the anti-spam and end- 
user communities lauded the approach as a possible 
permanent solution. Bayesian filtering claimed all the 
advantages of content filtering, while adding learning 
algorithms to defeat message character changes over 
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time and token-mangling attacks. It also promised an 
extremely minimal percentage of legitimate mail incor- 
rectly classified as spam (false positives), and added a 
level of personalization in block lists that was unprece- 
dented. Multiple Bayesian filter implementations 
appeared virtually overnight from both hobbyist and 
commercial sources, ranging from open source projects 
such as Bayespam and Bogofilter to implementations in 
Mozilla Thunderbird, Microsoft Outlook, and Apple’s 
Mail program. Bayesian filtering fell from the limelight 
nearly as quickly, however, due to predicted difficulties 
in large-scale or centralized implementations, some 
success by spammers in defeating early Bayesian 
implementations using poisoning attacks, assumptions 
of vulnerabilities common among other content filters, 
and other issues both real and imagined. Academics 
and pundits continue to discuss the value and potentials 
of Bayesian methods, and Bayesian classifiers are still 
deployed alongside other filtering systems, but a popu- 
lar opinion among administrators and the public seems 
to be that Bayesian filters are no better at providing a 
long-term solution to the problem than other methods 
[BAA, ALL, EMM, PAG, WAR, JdeBP2, BOW]. 


These dismissals are demonstrably premature, 
however. We implemented Bogofilter in a centralized 
capacity using common wordlists for an environment 
with thousands of accounts in April of 2003. Despite 
the best efforts of spammers to defeat it, Bogofilter has 
continually exceeded all expectations as a filter and has 
effectively solved the spam problem in our environ- 
ment, allowing us to remove all other spam-specific fil- 
ters from our architecture. Though we initially only set 
out to bring the problem under control and not necessar- 
ily to eliminate all spam from our environment, our esti- 
mates indicate we are currently able to accurately block 
98-99% of the spam sent to us. The filter blocks an 
average of 1,100 incoming spam messages per hour 
(700,000 to 1 million spam messages per month), or 
60-75% of our incoming mail volume. This amounts to 
roughly 40 spams per user per work day. We have 
vastly exceeded management’s goals as well as our 
own, and our users are able to conduct business without 
constant solicitations appearing in their inboxes. Per- 
haps most importantly, only two business-related legiti- 
mate messages have been reported as blocked by this 
filter since it was implemented more than a year ago. 
While no solution is likely to last forever, our results 
indicate it is possible that properly configured and 
deployed Bayesian filters are capable of sustaining a 
high enough success rate that administrators may finally 
be able to stop spending all their time refining filters 
and instead focus on securing end-user machines, thus 
finally moving to take the offensive in the spam war. 


Environment and Goals 


Company and Environment 


Our company, VHA Inc., is an alliance of not- 
for- profit hospitals, health systems and their affiliates. 


Blosser and Josephsen 


Member organizations range from single 50-bed facili- 
ties to large, integrated health care systems made up of 
multiple hospitals, physician clinics, and support care 
sites. Notable member organizations include Baylor 
Health Care System, Cedar-Sinai Health System, Mayo 
Foundation, and Yale-New Haven Health Services Corp. 


The mail environment in question provides mail 
services for more than 2,000 accounts distributed 
between our corporate headquarters and approxi- 
mately 30 regional offices nationwide. Monthly mail 
volume over the past year averaged 1.85 million mes- 
sages per month, 70% incoming and 30% outgoing. 
Based on filter logs and user feedback, we estimate 
today that on average 65-70% of the incoming mail 
(nearly 50% of the total mail volume) is spam. The 
content of our mail is typical of any company our size; 
the one exception is that since we operate in the health 
care industry and specifically the medical supply pur- 
chasing industry, we see quite a lot of legitimate mail 
having to do with pharmaceutical contracts and sup- 
plies, i.e., messages mentioning Viagra and other 
drugs can and do show up in our legitimate mail flow. 


Architecture 


All mail entering or leaving the environment 
passes through a centralized pair of load-balanced 
mail exchangers running the qmail MTA. The first 
exchanger currently has dual | GHz Pentium III’s, the 
second has dual 700 MHz PIII’s. Both have | GB 
RAM. Incoming mail is handed off to a Microsoft 
Exchange system which also handles all internal mail 
(Figure 1). Outgoing mail originating at the Exchange 
system is handed to the mail exchangers directly, 
while mail originating from internal applications uses 
a separate internal qmail server for relaying to the out- 
side. All of our spam filtering efforts have been tar- 
geted at the qmail environment due to resource utiliza- 
tion issues and end-user interaction concerns. 





Figure 1: Mail environment architecture before 
Bogofilter. 


Py 
ot 
ri 


The current Bogofilter implementation added 
one additional server to our environment. This server 
is responsible for caching mail as required for filtering 
purposes, receiving end-user filtering corrections, pro- 
viding the environment that administrators use for pro- 
cessing corrections and ongoing training of Bogofilter, 
and holding the master copy of the wordlists (Figure 
2). This server currently has dual 2 GHz Xeons and 2 
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GB RAM, but rarely has a CPU load average higher 
than 0.1. More information about this server is pro- 
vided in the ““Design’”’ section below. 


DMZ 






Exchange 2000 


Spam Processing Server 


Figure 2: Mail environment architecture after 
Bogofilter. 


Goals 


Our goal was not to eliminate spam completely 
from our environment, but to bring the problem under 
control so that normal work could go on unaffected. 
The initial target was to block 75% of our spam, which 
we initially estimated would be 30% of our incoming 
mail volume. We also needed to guarantee that we 
would not block any legitimate business mail in the 
process of blocking spam. Given our mail volume, any- 
thing we implemented needed to be fast and efficient 
enough not to make excessive demands on our existing 
infrastructure. As a final goal, we wanted to keep the 
blocking centrally managed at the server level, rather 
than something that the users would have to deal with. 


Initial Filtering Attempts 


Basic Filters 


Basic initial attempts at blocking via fixed-string 
matches against the sender address, reverse DNS 
lookups, and even a short-lived block of mail from all 
non-US domains met with predictable results. We 
quickly learned that the spammers we were dealing 
with were not just sending indiscriminately to global 
email lists, but were specifically monitoring the deliv- 
ery status of spam entering our environment and were 
willing and able to change their messages to avoid our 
filters. We were not willing to spend hours each day 
creating new filters which would be made obsolete 
within a week, so we began to look in earnest for a 
more viable long-term solution. 


Due to the nature of the spam arms race and our 
goal of efficiency we decided to target our efforts at 
automated content-based filtering. Blacklists were not 
considered an option due to the requirement to avoid 
blocking any form of legitimate mail. Challenge- 
response systems placed unacceptable burdens on our 
business contacts while simultaneously being consid- 
ered too easy to spoof. More dramatic efforts aimed at 
user and mail route authentication seemed unlikely to 
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provide any long-term relief, since we assumed they 
would primarily force spammers to continue their 
recent attacks on user computers themselves, using 
zombies and the local user’s own credentials to send 
spam through valid mail routes. Spam is not spam 
without the content, however, and content-based filter- 
ing appeared to offer the most desirable combination 
of accuracy and tunability. Scalability was the most 
likely point of failure, but we hoped that a sufficiently 
automated system could handle the load. 


Vipul’s Razor 


Vipul’s Razor, commercially distributed through 
Cloudmark, was selected for a pilot. This is a check- 
summing content filter; each incoming mail is com- 
pared over the network in real time to a database of 
known spam messages. This database is fed by user 
submissions and spamtrap addresses maintained by 
Cloudmark. The advantages of this type of system are 
that it has a negligible false positive rate and is rela- 
tively difficult to attack directly. The primary working 
attack is to find a way to pollute the spam database or 
otherwise disrupt the checksum sharing process. The 
disadvantages include scalability and ongoing mainte- 
nance; someone, somewhere, must receive each new 
spam before it can be stored for future comparison. 


While the implementation was effective in block- 
ing 30% of our incoming mail volume as spam, it 
introduced significant system overhead and proved 
ultimately unscalable. This was primarily due to the 
extra network traffic required to contact the check- 
summing database and, even more significantly, the 
Perl instantiation required per message on the mail 
exchangers for the Razor client. Our environment 
experienced more service delays and outages than it 
ever had previously, due to both intermittent network 
errors in reaching the database servers and excessive 
CPU load produced by the clients. 


Most important, however, was the fact that our 
end users reported no noticeable change in the volume 
of spam they were dealing with, despite the 30% 
reduction in overall mail volume. Complaints of offen- 
sive spam during this period actually increased. This 
was our first indication that our initial estimate of the 
size of our problem was off by a wide margin. We 
considered moving to a local checksum-based solution 
such as DCC, but eventually concluded that this 
method was too high maintenance to be viable in our 
environment. We began researching other alternatives, 
with even more attention paid to the real world system 
requirements imposed by prospective solutions. 


Bogofilter Implementation 


Bayesian Classification and Filtering 


Although there was work being done on 
Bayesian classification of spam as early as 1996, Paul 
Graham’s paper [GRA] is generally credited as being 
the first description of a solid implementation with 
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promising results. Gary Robinson later improved on 
Graham’s work, suggesting algorithm modifications to 
make Graham’s technique mathematically consistent 
with Bayesian statistics [ROB]. Robinson’s probabil- 
ity-combination improvements make use of the 
inverse chi-square function first described by Ronald 
Fisher and are therefore sometimes referred to as the 
Robinson-Fisher algorithm. 


Probability theory holds that the probability of a 
given event can be plausibly estimated by observing 
how often it has occurred in the past under similar cir- 
cumstances. Bayesian probability begins by creating a 
“prior probability distribution,” or “prior,” which esti- 
mates the probability of given events. The prior can be 
used to calculate the likely outcome of new events. 
Then experiments are run and the outcomes recorded. 
The prior can then be updated to reflect the outcome of 
the experiments. Bayesian algorithms are self-correct- 
ing in this respect: they learn from their mistakes. 


Token Scores 


If an email is viewed as a collection of discrete 
words, or “tokens,” a score can be assigned to each 
token. These scores are roughly comparable to proba- 
bilities in that they directly correspond to a token’s 
‘““spamminess”’ or “‘non-spamminess.”” Each token’s 
score represents how likely that token is to appear in 
an email composed of tokens that are uniformly dis- 
tributed and statistically independent. The more 
“spammy” or “not spammy” a token is, the less 
likely it would be to appear in such an archetypal 
email. That is, the presence of these “interesting” 
tokens in an email violates statistically neutral linguis- 
tic behavior in measurable ways. 


These scores are easy to calculate given a rela- 
tively even number of manually sorted spam and non- 
spam emails; the math is described in detail by Robin- 
son in his paper “Spam Detection” [ROB]. The tokens 
themselves, plus the frequency with which they 
occurred in the text of spam and non-spam messages 
previously used to train the filter, are used to create a 
prior probability distribution in database form. Most 
Bayesian spam implementations are “objective” in that 
they put a lot of thought into assigning the prior, espe- 
cially for tokens that do not currently exist as part of the 
prior. Bogofilter is no exception. When it encounters a 
new token, it assigns it the value of the variable robx 
(““Robinson’s x”’). robx is calculated as the average of 
the scores of all the other known tokens. The variable 
robs (‘“Robinson’s s”’) is used in situations where 
Bogofilter has only seen the token a few times (low 
data situations). robs acts as a user-defined metric of 
trust and limits robx’s effect in low data situations. 


Combined Scores 


Once the prior is used to calculate a score for 
each word in an email, these token scores must be 
combined into a single score, which is representative 
of whether or not the given email is spam. Bogofilter 
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uses the Robinson-Fisher algorithm for probability 
combination by default. There are three important 
aspects to the inverse chi-square driven algorithm 
Robinson has provided. First, it removes assumptions 
that were not relevant in the context of spam filtering, 
such as assuming the probability of a given token’s 
prediction being correct is the same whether its out- 
come is spam or non-spam. Second, it is more sensi- 
tive to token scores that indicate if a message has an 
underlying tendency toward spam or not spam. Third, 
it uses a user-defined variable to decide how many 
‘interesting tokens” to combine, which more consis- 
tently handles both large and small emails. For 
Bogofilter this variable is named min_dev. 
min_dev is the minimum deviation from the neutral 
score of 0.5 a token must have to be considered 
“interesting”? and therefore be included in the classifi- 
cation of the message as a whole. 


The other user-defined variables exist in the form 
of cutoff values to which the combined score of the 
email is compared. In binary classification, emails 
with scores over a singular threshold are considered to 
be spam. Those under it are not. Bogofilter optionally 
allows for three-factor classification. Two cutoffs are 
provided, spam_cutoff (for spam) and ham_ 
cutoff (for non-spam). Anything in between is 
labeled “‘unsure.”’ In practice, “unsure” mails provide 
interesting fodder for training, so this is the classifica- 
tion method we use. 


Selection of Bogofilter 


Graham’s paper had been published for some 
time at this point, and like the rest of the industry, we 
were intrigued by his reported success. The concept of 
a Bayesian approach appealed to us, given our experi- 
ence with shifting spam patterns and our desire to 
stick with content-based filters. The nature of the fil- 
tering, however, predicted the best results when each 
user maintained individual wordlists to most accu- 
rately reflect the unique nature of individual mail 
spools. Our goals of scalability and minimal user 
intervention were at odds with this, but we decided to 
do some initial testing to see if it could work with a 
single set of wordlists for an entire organization. 


We began monitoring the field of filters claiming 
a Bayesian implementation, and quickly settled on 
Bogofilter. Although other implementations had fea- 
tures Bogofilter at the time lacked, it was an obvious 
choice to us for its small overhead and its attempt to 
work within the Unix philosophy of “‘do one thing and 
do it well.” Bogofilter is written in C and expects to 
operate on standard I/O streams, adding a custom 
header to messages it operates on and/or indicating 
message status with an exit code. This fit our environ- 
ment perfectly. There are other Bayesian filtering 


‘Ironically, a primary impetus for the original creation of 
Bogofilter was to create a filter which implemented Gra- 
ham’s proposals but could be quickly deployed by individu- 
als [ESR]. 
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systems which tend to worry less about efficiency than 
dealing with cutting edge theory; these are ideal for 
experimental approaches and furthering the state of 
the art but are less useful in a high-volume production 
environment. This is not to say that Bogofilter’s 
implementation is not correct, however; the Bogofilter 
development group expends a good deal of effort 
ensuring that their implementation of Bayesian algo- 
rithms is correct, as well as tracking changes and 
advances in Bayesian filtering theory and providing 
their own measurements and contributions to the field. 
Bogofilter strives to provide the best of both efficiency 
and accuracy and was therefore a good choice for our 
environment. Selection of this tool has no doubt con- 
tributed heavily to our filtering success. 
Training 

Another likely source of our success is our train- 
ing process, both for initial wordlist seeding and subse- 
quent training and corrections. Bayesian filtering is 
unfortunately not a turnkey-style solution; while it is 
possible to implement a Bayesian spam filter (including 
Bogofilter) by simply following the steps ina HOWTO 
and running some scripts, best results require that the 
administrators have some understanding of the theory 
and how best to apply it to their environment. This is 
primarily relevant in the initial and ongoing training 
process. We spent a fair amount of time gaining an 
understanding of the theory and the work being done to 
apply it to spam filtering and Bogofilter’s specific 
implementation before moving to implement, and we 
designed our training process accordingly. 


Sorting 


Proper training requires a large pre-sorted collec- 
tion of spam and non-spam messages. It is nearly 
impossible to create an effective Bayesian classifier 
using only a handful of mails. The filter needs to be 
trained on at least several thousand messages each of 
spam and non-spam from the start, preferably in 
nearly equal amounts [LOU]. 


To seed the initial wordlists we therefore col- 
lected several days’ worth of incoming and outgoing 
mail at the mail exchangers. This provided us with 
more than 220,000 messages, which we then classified 
manually into groups of spam and non-spam. Approx- 
imately half of these were discarded as mailer daemon 
traffic such as bounces, delivery overhead, and virus 
quarantine messages. All outgoing mail was automati- 
cally classified as non-spam but was kept to provide a 
large body of known good mail so that if the filter 
established any biasing error it would be in favor of 
keeping legitimate mail. The remaining mail was clas- 
sified in successive phases. Mails were manually 
sorted by one administrator while another looked for 
patterns in the already classified mail to allow for 
batch classifications to speed up the process. These 
batch filters were similar to the ones employed by 
other anti-spam software; while not long-term options 
for our environment, they worked in the short term 
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due to the static nature of the mail snapshot we were 
operating against and the lower efficiency require- 
ments. Anything faster than a human was a benefit 
here. Further, this processing was done in a develop- 
ment environment away from the regular mail exchang- 
ers, so regular mail load was not affected. We also took 
some time to develop interactive shell scripts to aid the 
manual classification process, both to speed it up and to 
preserve as much privacy as possible. The scripts pro- 
vided only the headers of messages, color-coded to 
highlight suspicious patterns. The full messages were 
available at operator request, but were rarely needed in 
the initial classification stages. 


After approximately 30,000 messages had been 
sorted, we began to incorporate Bogofilter itself as a 
sorting tool. We tested its effectiveness by running it 
against 20,000 pre-sorted messages, 10,000 each ran- 
domly taken from the sorted collections of spam and 
non-spam. Bogofilter without any — established 
wordlists was presented with random individual mes- 
sages from these collections and asked to determine if 
they were spam or non-spam. If its classification 
agreed with the human classification, no action was 
taken. If its classification disagreed with the human 
classification or Bogofilter was unsure of how to clas- 
sify the message, it was corrected based on the human 
classification (this process is known as “train on 
error”). The output of the classification versus the 
human classification was presented in real time during 
the test run to an administrator (Figure 3). At the end 
of the run, messages Bogofilter had consistently got- 
ten wrong were further investigated. 


Results at this early stage were simply shocking. 
At the start of the run Bogofilter would tend to get a 
few messages wrong until it had a handful of each 
type of message in its wordlists, at which point it 
would immediately begin to get the vast majority of 
classifications correct, increasing dramatically and 
observably until it had seen several thousand mes- 
sages, at which point it would generally stabilize at 
around 95% accuracy. Inaccuracies were either false 
negatives or “unsures”’ in all but a handful of cases. 
More often than not messages which Bogofilter persis- 
tently had trouble classifying were re-evaluated to find 
that the human administrator had gotten them wrong 
in the first place and Bogofilter was correcting us. 


Once we had performed several of these test runs 
we were confident Bogofilter could be utilized as a 
sorting tool. We created wordlists based on all the 
messages sorted so far and then ran Bogofilter against 
the remaining unsorted lists, allowing it to sort them 
into provisional groups. These groups were then fur- 
ther evaluated manually by an administrator to verify 
the accuracy of their sorting before they were added to 
the global collections. 


Eventually, nearly 20,000 messages were 
removed due to excessive ambiguity about their nature. 
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These included news updates, potentially legitimate 
commercial mailings, religious and inspirational mes- 
sages, joke-of-the-day lists, and others. We decided it 
would be better to move forward with these messages 
unclassified and allow users to give more feedback 
during the pilot phase of the process. 


This sorting process took both administrators the 
better part of a week to complete, and we became inti- 
mately familiar with the character of the spam we 
were receiving. This was very tedious work, but there 
is no doubt taking the time to do this was a key to the 
current success of our implementation. 


Configuration and Tuning 


Once the initial message sorting is complete, the 
filter must be configured and tuned for its environment. 
There are several variables to consider, along with their 
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interaction, and we attempted to use appropriate values 
for each during each stage of the training process. On 
some occasions we did test runs using various combina- 
tions of settings to see which provided the best results. 
The theories involved were still being formed and 
actively debated, so we were experimenting pragmati- 
cally to find the best settings, but in most cases the theo- 
retical and empirical work that has been published since 
agrees with our results. On a related note, even though 
Bogofilter at the time shipped with tools to perform all 
of this training and tuning, these proved too nascent and 
incomplete to meet our needs, so we developed our own. 
Current versions of Bogofilter ship with complete train- 
ing tools which provide even more rigorous tuning func- 
tionality than what is described here. We are also deeply 
indebted to Greg Louis for his excellent tuning docu- 
mentation [LOU]. 
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Figure 3: Screenshot of the beginning and end of a preliminary Bogofilter classification run of our sample mail corpora. 
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As noted above, Bogofilter’s primary configura- 
tion variables in tri-state classification mode are 
robs, robx, min_dev, spam_cutoff, and ham_ 
cutoff. spam_cutoff and ham_cutoff default 
to 0.95 and 0.10, respectively. For the sorting process 
above we reduced ham_cutoff to 0.05 to require a 
higher level of certainty from Bogofilter before it was 
allowed to flag a message for the legitimate corpus. 
For the training process we also used 0.05 to inflate 
the number of legitimate mails that would be classi- 
fied as “unsure” and therefore used for training dur- 
ing the “train on error’ process. These were more 
safeguards to ensure any biases introduced during 
training would be on the side of blocking too little 
mail instead of too much. 


To determine appropriate values for spam_ 
cutoff and robs, we ran several iterations of a 
training script using a methodology similar to recom- 
mendations published by Greg Louis [LOU2, LOU3]. 
We divided our pre-sorted message collections into 
three randomized groupings. Group A had 20,000 
messages (10,000 each of spam and non-spam). Group 
B had 10,000 messages (5,000 each of spam and non- 
spam). Group C had all remaining messages (approxi- 
mately 45,000). Bogofilter was fully trained on each 
message in Group A, then trained on error for each 
message in Group C. Group B was then used as a test 
corpus, with no training done and errors tabulated. The 
train-on-error and test runs were repeated once. Follow- 
ing this, Group B was used to further train on error, 
then again used to test; this process was also repeated 
once. In this fashion Bogofilter was either fully trained 
or twice trained on error for every message in our cor- 
pora. See Appendix A for more detail. 


robs/spam_cutoff round 
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We needed to determine if we should use the 
default spam_cutoff of 0.95 or if we could safely 
use a more aggressive value of 0.90. We also needed 
to determine if we should use a robs value of 0.01 or 
a more conservative 0.001 .2 We therefore ran the 
above test suite for each of the four permutations of 
these two values. While each of the four gave nearly 
identical results by the final two test rounds, the 0.01 
and 0.90 combination had the highest accuracy during 
the initial rounds and was therefore selected (Figure 
4). It was a pleasant surprise that using a lower 
spam_cutoff actually improved accuracy; appar- 
ently a statistically significant amount of our spam 
scored between 0.90 and 0.95, and using the lower 
cutoff meant more of these were classified correctly 
on the first pass. 


For robx, we again cleared our wordlists and 
ran several iterations of a similar training script, 
beginning with fully training on 10,000 each of spam 
and non-spam, then training on error for the rest of the 
messages. At the end of each iteration we took the 
final robx value and used it as the initial robx value 
for the next iteration. We did this several times, until 
the robx value stabilized at 0.477112. 


The final value is min_dev. We left this at the 
default of 0.1 throughout training and into production. 
The Bogofilter tuning documentation recommends 
moving this to somewhere between 0.3 and 0.46 to 
take into account fewer words per message, but testing 


2At the time there was some debate about whether overly 
conservative robs values might cause unpredictable results. 
Louis has since conclusively determined that values lower 
than 0.01 do cause problems if a token occurs a few times in 
one wordlist but not at all in the other [LOU4]. 
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Figure 4: Results of varying the value of robsand spam_cutoff. 
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on our part showed that this reduced accuracy, and 
recent experiments with raising the value to 0.3 
resulted in an immediate and dramatic increase in the 
amount of spam that made it through the filter. How- 
ever, we will likely need to revisit this as wordlist 
attacks become more focused. 


At this point the tuning was considered com- 
plete. Apart from raising the ham_cutoff back to 
0.10 (to avoid an excessive number of “‘unsure”’ clas- 
sifications), these were the wordlists and configuration 
we took to production (there are other options such as 
how the lexer should tokenize HTML tags and IP 
information; we have left these at the defaults). 
Throughout the tuning process we remained amazed at 
the speed with which Bogofilter gained accuracy dur- 
ing the training runs, and concerns that using a single 
set of wordlists for an organization of this size would 
lead to significant misclassifications appeared 
unfounded. Although we tried to keep in mind that our 
results were preliminary, saying that we had signifi- 
cant remaining doubts by the end of the training itera- 
tions would be a mischaracterization. 


Design 


As noted above, our border mail architecture 
consists of a pair of mail exchangers running qmail. 
Prior filtering attempts ran as mirrored installs on both 
of these servers. Bogofilter, however, required some 
method of maintaining the same word list on both 
servers and any future servers added for expansion 
purposes. We also needed a central location to receive 
end-user misclassification notices so that an admin- 
istrator could process them and train Bogofilter on fil- 
tering errors. We knew this central host would likely 
need to be able to perform extensive training and other 
processing operations which we would not want to 
impact our general mail load. None of this processing, 
however, needed to happen in real time for the mail to 
be properly filtered. We therefore added a new server 
to provide for general spam processing. When filter 
corrections are necessary, the end user forwards the 
misclassified message to this server, where an admin- 
istrator verifies the request and trains Bogofilter, updat- 
ing the central copy of the wordlists which are stored 
on this server. These wordlists are updated on the mail 
exchangers nightly. Since the mail exchangers have 
their own copies of the wordlists, there is no effect to 
mail flow if this server is down or otherwise unreach- 
able. Finally, the spam server maintains data on all the 
training operations that have occurred so that the 
wordlists can be recreated from scratch as required. 


The actual filtering happens at the mail exchang- 
ers during the SMTP exchange. We use the qmail-qfil- 
ter wrapper around qmail-queue to provide real-time 
message filtering options. Our qfilter invocation first 
notes the mail in the filtering logs, then pipes the mes- 
sage through Bogofilter, then copies the mail to the 
Spam server cache, then checks the Bogofilter-gener- 
ated spamicity header to determine whether or not the 
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message is spam (Bogofilter is also able to indicate 
message status with an exit code, but passthrough 
mode and header-based filtering proved more compat- 
ible with the qfilter and spam caching server imple- 
mentation). If the message is spam, gqfilter exits 31, 
which causes qmail to refuse to accept the message 
from the originating server. See Appendix B. 


The solution of refusing to accept spam mail dur- 
ing the SMTP exchange has become somewhat popu- 
lar because it dodges the problems of network conges- 
tion and server load created by attempting to send 
bounce messages to senders that do not exist. We also 
chose it because in the case of false positives it lets the 
legitimate sender know immediately that their mes- 
sage was refused, allowing them to follow up quickly 
with their intended recipient. External relay servers 
may in theory still create unnecessary bounces or 
cause delays in receipt of legitimate bounces, but this 
was determined the best fit for our environment. How- 
ever, if spammers begin to train “evil” Bayesian fil- 
ters to attack “‘good” Bayesian filters (as John Gra- 
ham-Cumming has suggested [JGC]), we will need to 
reconsider any implementation aspects such as this 
one which indicate to spammers the real delivery sta- 
tus of their messages. 


To create our ongoing training framework we 
augmented the scripts we had written for the initial 
wordlist creation. Mails users submit for correction are 
viewed on the spam server by the administrators either 
interactively in Mutt or using a custom script that iter- 
ates across the entire queue. In either case, the admin- 
istrator is presented with the mail headers and informa- 
tion on how Bogofilter classified the message initially 
and how it is currently classified (Figure 5). In some 
cases Bogofilter will have changed the way it classifies 
a given mail based on prior corrections, and the mes- 
sage can be skipped. If Bogofilter still misclassifies the 
message, the administrator can view the message head- 
ers and body and provide correction as required. The 
training has been deliberately kept as a manual process 
to prevent user error from corrupting the wordlists. We 
also do not use Bogofilter’s auto-update (-u) switch, as 
this is likely to introduce a significant amount of error 
in a high-volume environment. 


While testing this configuration, the primary 
issue that had to be resolved related to the fact that the 
users were forwarding messages for correction after 
Exchange had delivered them. Exchange modifies the 
message headers (and in some cases the body) signifi- 
cantly, which means that the messages users forward 
us are not the same messages Bogofilter originally 
classified and are useless for training. The best solu- 
tion to this seemed to be to cache the mails as Bogofil- 
ter originally saw them and look them up as required. 
The mail exchangers therefore forward a copy of all 
incoming messages to the spam server, where they are 
cached for a period of two weeks. The training lookup 
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scripts take the mails users forward, extract several 
pieces of header data, attempt to reconstruct the origi- 
nal headers, and look up the original message in the 
cache using these headers. Message headers are 
cached parallel to the full bodies to speed up lookups. 
This is quite frankly the least elegant piece of this 
entire filtering system, but no other option has been 
discovered for easily dealing with Exchange’s header- 
mangling problem. 


See also Greg Louis’ article “Is Bogofilter Scal- 
able?” [LOUS5], which was written at approximately 
the same time as we were developing this implementa- 
tion, and discusses many of the same issues. While 
Louis ends with uncertainty that the implementation 
he describes would scale enough to be able to handle 
an environment 10 or 50 times the one he describes 
(3,500 or 17,500 spam messages per day, respec- 
tively), our implementation demonstrates that a similar 
setup can scale to at least 75 times the spam volume 
he describes. We are not verifying all of our classifica- 
tions as rigorously as he suggests, but so far this has 
not created a problem. 


Implementation 


The Bogofilter roll out to production was to 
occur in four phases. Phase 0 was a two-week trial by 
the CIO for his mail only, primarily so that he could 
verify no mail would actually be blocked until further 
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testing had been done. Phase 1 was a one-month user- 
interactive pilot with the entire MIS department 
(approximately 80 users) participating. Phase 2 was a 
three-month user-interactive period with the entire 
company asked to participate. Phase 3 implemented 
Bogofilter in a decision-making capacity such that it 
began actively preventing the delivery of messages it 
classified as spam. 


For Phases 0, 1, and 2, Bogofilter was imple- 
mented in passthrough mode alongside the existing 
Vipul’s Razor deployment. Any mails that Vipul’s did 
not block were filtered through Bogofilter, and a 
header was added with a classification for the “spam- 
icity” of the message in numeric form. Outlook rules 
were created and distributed to the pilot groups to sort 
mails classified as spam and unsure into their own 
folders. Users were asked to review the contents of 
those folders on a periodic basis. Any legitimate mails 
found in the spam or unsure folders were to be for- 
warded to one address, and any spam messages found 
in the inbox or unsure folders were to be forwarded to 
another address. Administrators monitored those 
addresses and corrected Bogofilter on mails that had 
been classified incorrectly as described in the 
‘*Design”’ section above. 


Some small training complications resulted from 
our attempt to maintain a single, company-wide set of 


i is ia Ya a aheatheeneabacdiedill 


heme X-Bogosity: Unsure, tests=bogofilter, spamicity=0.846170, version=0.15.0 


X-Bogosity: Unsure, tests=bogofilter, 


iH ARES AE ALE prob spamicity histogram 
3.00 3 0.010822 0.002048 ### 
oe 53 0.175481 0.029489 ### 


Saya Anas Ole 142 2651 ees Ses ese es es: 


spamicity=0.846787, version=0.135.0 


4a 8 [Sy ) HHHHHHRHRHHHRRRHHHHRRRHHH 


Obs CRY 45a ei 


( ares a 


a = = 
| a rs s J i= “2 4 
W Pal yd 


\ AE Fae SS Pl 


O, Sales 
Wet: Ole 
Ost iows O, 
aioe kciete Okataya la 


sees eset ., 


ese ee: es: 


Moar aes Ky 
Unsure, tests= bogoFilter spamicity=0.846170, Sie Olas 


eel mii ‘a IESE od 


Content-Type: text/html; che 

MIME-Version: "9 0 

Content-—-Transfer Heilesle eine fbit 
etal ale) eee 


PLO E EE gGe 


ce ieee 


A-Priority : i 
Date: Fri, 18 Jun 2004 02 


[make (s)pam/make 


2:22 2:2°2-2-222: 22222: 2: 2: 


cei ee eee ee: 


ymail@mail2. 


i=) .1fCampaig m.com 
mee ett eee Se ao ants 2004 


vha.com> 


(HELO ke TSe 2.com) (67.43.59.132) 


OVA UEC Y 0101018) 


(niotspam/(m)ore/bae rofilter 





Figure 5: Screenshot of a spam message being displayed in our training tool. 
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wordlists. In reality, there is a large grey area between 
“spam” and “non-spam” with a user population of 
this size. ‘Joke of the day” lists, airline/hotel advertis- 
ing, and religious messages are some examples. 
Worse, end-user collisions were initially frequent, 
where two users found the same piece of bulk mail in 
their unsure folders and one would report it as spam 
while the other reported it as non-spam. In an extreme 
case one user reported mail from the same home- 
improvement list as either spam or not spam, based on 
the home-improvement advice in question (roofing 
advice was spam, gardening advice was not). These 
were generally resolved by the administrators on a 
case-by-case basis, usually by either letting mob-rule 
prevail or allowing Bogofilter to decide based on the 
content of the mail in question. In the case of the 
home-improvement list, the administrators did as the 
user instructed, and Bogofilter adjusted accordingly. 


Despite these minor complications, the pilot 
phases were an unqualified success. All but a handful 
of users reported all the spam gone from their inbox 
into either the spam or unsure folders. Though an 
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implementation goal was to avoid user interaction and 
the final production system primarily does that, users 
seemed to appreciate the brief opportunity to do some- 
thing about the spam that was reaching their inboxes. 
Most importantly, users were able to observe the filter 
in action and confirm first-hand that legitimate mail was 
not being misclassified; even during the pilot phases, no 
legitimate business-related mail was reported as mis- 
classified by the filter. Finally, the administrators were 
confident at the conclusion of the pilot period that the 
maintenance and ongoing training framework was scal- 
able and practical for the task at hand. 


Once all parties were satisfied with the accuracy 
of the solution, the mail exchangers were reconfigured 
to use Bogofilter’s spam classifications to block mail. 


Results and Observations 


Since its deployment in a blocking capacity, 
Bogofilter has continued to provide unprecedented 
accuracy and efficiency in mail filtering. Since the 
beginning of our spam mitigation effort, incoming 
email volume has risen from 900,000 to 1.4 million 
messages per month. Bogofilter has consistently 
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Figure 6: Percentage of inbound mail/Percentage of inbound mail blocked as spam, October 2002-October 2003. 
Bogofilter was installed in late April 2003. As we refused delivery of spam, we bounced less mail. As we 
bounced less mail, the outbound percentage dropped, and inbound percentages increased accordingly. 
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blocked 60-75% of this incoming mail as spam, or The two remaining misclassified messages were 
between 700,000 and | million messages on average real false positives. The first was sent from a vendor 
per month. The filter averages 1,100 blocked mails per providing an employee reward program. Though this 
hour at the exchangers; when the average number of was a legitimate business mail, its content was virtu- 
recipients per blocked mail is factored in, this is ally indistinguishable from spam (e.g., “someone has 
roughly 40 spams per user per work day. While it has sent you $xx.xx, click here to redeem it”). The other 
been difficult to quantify exactly how much spam still false positive was a bulk employee satisfaction survey 
enters our environment, based on end-user reports, sent from an external source. Some recipients received 
mail logs, and manual inspection we believe the filter this mail fine, while others had it blocked. Inspection 
is 98-99% effective. This performance has persisted revealed that for some reason some of the mails had 
without fail for more than a year (Figures 6 and 7). been sent with both Spanish and English language 


parts, while the rest were English only. The English 
mails got through, while the Spanish mails were 
blocked because those tokens had previously been 
encountered primarily in spam messages. 


Further, in that time there have been only five 
blocks reported as false positives, only two of which 
were legitimate filter misclassifications and business 
related. One of these false positives involved a user 


mistakenly reporting legitimate mail as spam, then Since going to production we have not experi- 
having similar followup messages blocked as spam. enced any of the predicted downsides of using a single 
Another involved a group of users reporting a group of set of wordlists with centralized administration. It is 
related mails as spam, then having a new user attempt possible that our user population’s legitimate mail is 
to receive mail from that same source and having it more homogeneous than that of an average organiza- 
blocked. A third was a user who had personal mail tion, but this seems unlikely. Rather, expectations that 
regarding a PayPal transaction blocked; as personal the filter’s accuracy would “‘fall apart” when the same 
PayPal mails are not a common occurrence at our wordlists were used for many users seem simply not to 
company, most of Bogofilter’s opinion of PayPal have been borne out in practice. While even a success 
tokens had been derived from spam categorization of rate of 99% is an order of magnitude lower than the 
PayPal phishing scam mails. All of these were 99.9% and higher rates individual users of Bayesian 
resolved through either correcting the previous mis- filters have achieved, this does not necessarily indicate 
classifications or training on the blocked mail. an inherent loss of discrimination ability due to shared 
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Figure 7: Overall mail traffic, August 2003-April 2004. Spam is multiplied by negative 1, to make the graph more 
readable. Data gaps are caused by log rotations. 
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wordlists. It could instead be the result of only training 
on the filtering errors users report instead of on all 
errors. In any case, blocking 99% of our spam is more 
than sufficient to meet the needs of an organization such 
as ours, and much better than any other solution we 
have evaluated or considered. The other minor wordlist- 
sharing issues encountered during the pilot of some 
users reporting mail as spam and other users reporting 
the same mail as non-spam seem to have resolved them- 
selves; most likely those mails are accurately classified 
as spam, and the users who reportedly wanted to receive 
them are not actually noticing when they are blocked. 


The effect of our blocking success on the end- 
user population has been dramatic and overwhelm- 
ingly positive, and the spam problem is solved from 
their perspective. Business has been able to continue 
without significant interruption. The only user com- 
plaints involved the minor number of misclassified 
mails referenced above. One of the more obvious 
signs of our success is that many of our users once 
again consider it odd to see more than a few spams a 
week in their inboxes, and will even call the support 
desk to complain when this happens. They continue to 
appreciate their ability to take some control of the 
problem when spams do reach them, and many speak 
of the filter in terms that make it obvious they are able 
to observe — and have even come to expect — nearly 
immediate results when they report new spams that 
have started getting into their inboxes and Bogofilter 
is trained on them. Periodically we are even contacted 
by other companies who have received word-of-mouth 
reports of our success and are interested in the details 
of our implementation; our users are apparently satis- 
fied enough to mention our success when their con- 
tacts complain about spam. 


Our ongoing training framework has also worked 
much better than we anticipated. In theory we should 
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be monitoring all mails that are classified as “unsure” 
and using these to train Bogofilter. This would be a 
significant amount of effort, however, so we have 
instead limited our ongoing training to spam messages 
users report. Usually several users will report the same 
message as spam. Training on one message will gener- 
ally be sufficient to update Bogofilter’s opinion of 
duplicate or similar messages, so the training scripts 
re-examine each message before it is presented to the 
administrator. If Bogofilter correctly classifies the 
message on re-examination it is automatically skipped. 
On average we only need to train on 15-20 unique 
messages per day before Bogofilter has ‘caught up” 
and the rest of the training queue can be skipped. 
Using this methodology one administrator is able to 
process all the incoming requests in four hours per 
week. The continued success of the filter indicates that 
at least for the time being this training is sufficient. 


So far this solution has also proved entirely scal- 
able, with no obvious ceiling in sight. Mail exchangers 
can easily be added to support growing mail volume, 
with each only needing a Bogofilter installation, a 
copy of the configuration files, and the ability to pull 
the most recent copy of the wordlists from the spam 
server once per day. The system meets the goal of 
being efficient enough to operate in real time without 
affecting our mail-flow capacity; real-time Bogofilter 
evaluation of each message costs almost nothing and 
has not noticeably affected the system load or message 
processing time (Figure 8). The initial wordlists were 
31 MB combined; that has only grown to 40 MB. The 
spam server is a potential bottleneck both in the need 
to receive a copy of every incoming message and the 
ability to lookup cached messages for training in real 
time, but the single server in place today maintains a 
CPU load average of 0.1, so this is not any kind of 
immediate concern. If message lookups become a 





% ls -s -h random_words.txt 
12M random_words.txt 


% < random_words.txt rl head -c 10240 > words_10K.txt 

% < random_words.txt rl head -c 102400 > words_100K.txt 

% < random_words.txt rl head -c 1024000 > words_1000K.txt 

% < random_words.txt rl head -c 10240000 > words_10000K.txt 


% /usr/bin/time sh -c ’< words_10K.txt bogofilter’ 
0.06user 0.03system 0:00.08elapsed 101%CPU (Oavgtext+0avedata Omaxresident)k 
OinputstOoutputs (496major+29lminor) pagefaults Oswaps 


% /usr/bin/time sh -c ’< words_100K.txt bogofilter’ 
0.36user 0.03system 0:00.38elapsed 101%CPU (Oavgtextt+0Oavgdata Omaxresident)k 
OinputstOoutputs (498major+440minor) pagefaults Oswaps 


% /usr/bin/time sh -c ’< words_1000K.txt bogofilter’ 
1.5luser 0.10system 0:02.52elapsed 63%CPU (Oavgtext+Oavedata Omaxresident)k 
Oinputst+Ooutputs (497majort+978minor) pagefaults Oswaps 


% /usr/bin/time sh -c ’< words_10000K.txt bogofilter’ 
5.74user 0.14system 0:05.88elapsed 99%CPU (Oavgtextt+Oavgdata Omaxresident)k 
OinputstOoutputs (499majort+1033minor)pagefaults Oswaps 


Figure 8: Run times of Bogofilter for various message sizes, using our production wordlists. These times are rela- 
tive to a dual-CPU 700MHz PIII, the least-powered server in this architecture. 
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bottleneck we can easily convert the header lookup to 
a faster database format or something similar. Extremely 
large organizations may find keeping up with the 
ongoing training is difficult, as four hours for this 
environment is potentially a full-time position for an 
organization with 10 times our mail volume. This can 
be further automated, however, using methods similar 
to those we describe in “‘Future Work”’ below. 


As with any other filtering solution, we have 
observed spammers attempting to adjust their tech- 
niques to get around this filter. However, we have not 
seen them succeed, as evidenced by our ability to sus- 
tain a 98-99% block rate for more than a year. Initial 
attempts based on weaknesses and bugs in particular 
early implementations of Bayesian filters ceased to be 
effective long ago. Wordlist poisoning, token obfusca- 
tion, token dilution, and microspam attacks abound, 
but have had no significant effect on our ability to 
detect spam messages. This is despite a popular belief 
that they collectively represent an Achilles’ heel of 
Bayesian filters [EMM, JdeBP2]. As Graham origi- 
nally predicted, and he and others have since repeated 
[GRA2, JGC2], the attackers either completely fail to 
understand the nature of the system they are attacking, 
or the Bayesian filters simply adapt. Where wordlist 
attacks have had any success at all it is probable that 
Effective Size Factor and Bayesian Noise Reduction 
techniques will be effective in mitigating their value, 
as has already been demonstrated elsewhere [ROB2, 
ZDZ]. Microspams, which contain only a URL and a 
handful of words, show the most current potential for 
giving the filter trouble; however, each variant works 
at most once, since the URLs and the unique message 
headers immediately become high spam indicators, 
and the lack of other information provides no opportu- 
nity for non-spam tokens which could lower the spam- 
icity of the message. In any case this is at best a tem- 
porary advantage for spammers, as there are filter 
enhancements to deal with these as well; some of 
these are discussed in the next section. 


Future Work 


Although this implementation is functioning well 
beyond expectations and has shown no degradation of 
performance in the time it has been running, there is 
room for improvement, both to increase filtering effi- 
ciency and to stay ahead in the arms race. The point of 
adaptive filters of course is that they will automatically 
keep up as spam changes, but there are process efficien- 
cies and theory improvements to incorporate. There are 
also possible side-channel attacks we need to be wary of. 


First, we are planning an upgrade to a more 
recent version of Bogofilter. We implemented on ver- 
sion 0.13.0. While this has proved stable and reliable 
in our configuration, it is rather out of date. Newer 
versions promise increased functionality and configu- 
ration options. In addition there are lexer changes to 
support continuing work in the field of applying 
Bayesian filtering to spam, such as the way HTML 
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messages are handled. Most importantly, new versions 
of Bogofilter contain support for Effective Size Factor 
(ESF), which is designed to account for loss of filter 
discrimination introduced by naturally occurring 
volatility in the data as a result of token redundancy. 


In addition to upgrading Bogofilter, we would 
like to track the shifting nature of our spam more rig- 
orously by being more proactive in training on mes- 
sages which are misclassified or classified as 
‘“unsure.”’ As noted above, we currently primarily rely 
on end users to forward us misclassifications; while 
this currently works, at the moment it is only obvious 
to end users when mails that should have been blocked 
arrive in their inboxes. It is not obvious to them when 
legitimate mails on certain topics are flagged as 
“unsure.”’ As the wordlists are populated with more 
and more tokens that are only spam, we run the risk of 
introducing false positives. 


One proposed method of addressing this issue is to 
make it visually obvious to end users when messages 
are classified as “‘unsure” so that they can forward them 
to the appropriate training address as they see fit. We do 
not want to inconvenience them by asking them to 
install rules to filter mails into separate folders again, so 
we are investigating ways to modify the way the mes- 
sage displays in Outlook. One option is to use the “X- 
Message-Flag” header to provide a message indicating 
the mail is suspected to be spam; Outlook will present 
this to the user as an alert header above the message 
body. If users cooperate, this would allow us to train on 
a higher percentage of unsure mails (especially those 


3Spammy tokens tend to appear in groups. For example, an 
email containing the token “mortgage” is more likely to 
contain the token “mortgage” again, or other related spam- 
my tokens such as “refinance.” This token redundancy can 
have a negative impact on filter discrimination by artificially 
magnifying the combined probability disproportionately be- 
tween large and small emails. The good news is that prob- 
lems with token redundancy can be dealt with mathematical- 
ly; Gary Robinson has evidence to suggest that applying an 
Effective Size Factor (ESF) improves filter discrimination 
significantly by reducing the impact of this redundancy 
[ROB2]. Greg Louis’ work confirms this [LOU6]. 


Note that token redundancy should not be confused 
with basic statistical independence, though the issues may 
appear similar. Confusion about this distinction has led to 
much ado about the violation of statistical independence in 
the context of Bayesian mail filtering because current ap- 
proaches assume statistical independence of the input data 
where it does not actually exist. For example, training on er- 
ror violates statistical independence by selecting messages 
for training based on prior examination, while the algorithms 
assume that training is done using a truly random sample of 
messages. Some have claimed this lack of actual indepen- 
dence represents a weakness in the use of Bayesian classi- 
fiers that is somehow “exploitable” [ALL]. Violations of as- 
sumptions of statistical independence are not news to proba- 
bility theorists, however, since these assumptions are nearly 
always violated in practice [ROB3]. In addition, there is 
work to suggest that even the existence of the assumption of 
statistical independence has been overstated in the context of 
Bayesian classifiers [DOM]. 
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which are non-spam) without requiring administrators 
to do the sorting manually. 


However, we would prefer to rely on end-user 
interaction as little as possible. We intend to imple- 
ment tools that will allow us to test Bogofilter auto- 
matically on incoming mails when we can make rea- 
sonable guesses on the status of those messages inde- 
pendently. Probable spam or non-spam messages 
which Bogofilter apparently misclassifies can be 
flagged for administrator review, and the filter can be 
corrected if necessary. Any spam-blocking methods 
which provide useful categorization of some spam but 
are either too resource-intensive or have an unaccept- 
ably high false-negative rate for use in our production 
environment are good candidates for this kind of sec- 
ondary screening. One simple option we will likely 
implement is to create and advertise a low-priority 
mail exchanger tarpit; these take advantage of the fact 
that a fair amount of spam software attempts to target 
low-priority mail exchangers with the assumption they 
are less heavily guarded than primary exchangers. 
Mail sent to this tarpit would be flagged for review, 
with the assumption that it is all spam. Any mails 
which Bogofilter was unsure about would be good 
candidates for training. Another option is to reimple- 
ment tools like Vipul’s Razor on secondary hardware 
to scan a random sample of cached incoming mail. For 
legitimate mails, we will likely begin randomly sam- 
pling outgoing mail and flagging any messages which 
Bogofilter classifies as unsure or spam. Note that none 
of these tools would be added as active filtering mech- 
anisms, but they could be useful outside of the mail 
flow in automating the process of keeping Bogofilter’s 
wordlists up to date with the shifting nature of spam. 


We will also continue to tweak the filter options 
as appropriate, and may make some of them dynamic. 
Possible options here include automatically modifying 
the Bogofilter configurations on the mail exchangers 
based on time of day to filter mail more aggressively 
during high spam periods such as weekends and late 
night, or at least to flag unsure messages during these 
times for scanning. Similarly, if the frequency of 
microspams increases and these become a problem we 
may lower min_dev to catch more of them on the 
first pass; another option is to make min_dev (or 
other parameters) dynamic based on message size, 
e.g., set min_dev to 0 for messages less than one 
kilobyte in size. It is entirely possible, however, that 
the filter will simply adapt; in this scenario things like 
normal local SMTP headers would become high 
enough spam indicators that microspams would 
always be blocked, while both short and long legiti- 
mate messages would get through due to their legiti- 
mate tokens. In any case, both of these types of 
dynamic configuration would probably only be neces- 
sary temporarily; eventually Bayesian mail classifiers 
will likely reach the point of using meta data — such as 
message size, arrival time, number and type of attach- 
ments, etc. — as spam/non-spam tokens. 
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In addition, we have done some investigation to 
see if we could use a lower base spam_cutoff and 
let even less spam through. Examining the classifica- 
tion of both the spams and legitimate mails that are 
being classified as “unsure,” it looks like we would 
be safe dropping this value to 0.85 or even lower. 
However, we will need to do much more rigorous test- 
ing before we make such a fundamental change. We 
will also preferably have the above spot-checking of 
active mail and other measures in place to track any 
increase in the ongoing probability of false positives 
using this cutoff. 


We would like to improve the caching system or, 
preferably, remove the need for it altogether. Storing 
emails for this kind of retrieval is not ideal. Recreating 
headers that Exchange has removed is inefficient at 
best and does not always work. At the least we would 
like to move to a system where each message can be 
flagged in a way that Exchange will not modify so that 
a simplified lookup will work, but this is not likely to 
be possible without adding something to the message 
body or requiring users to forward messages in a spe- 
cific format. Both of these options are at odds with 
business standards. Other Bayesian classifiers such as 
DSPAM have experimented with keeping a record in 
the database of which tokens were present in a given 
message, but it is questionable whether this would 
scale to an environment such as ours. 


Finally, we will need to reconsider rejecting 
spams during the SMTP exchange instead of silently 
dropping them. This gives spammers information about 
the delivery status of their messages which allows for 
the possibility of training an ‘evil’? Bayesian filter on 
what messages our filter rejects or accepts. Given 
enough messages, this would allow spammers to craft 
messages specifically designed to pass through our fil- 
ter. John Graham-Cumming has discussed this attack in 
detail [JGC]. However, this technique is easily defeated 
by simply not returning any information to spammers 
on the delivery status of their messages. This is non- 
ideal; it means that in that in the case of false positives 
the legitimate sender would not get an error to indicate 
their message was not received. However, the low 
false-positive rate demonstrated by this type of filtering 
likely justifies this inconvenience. 


Related Work 


At this time several other medium to large envi- 
ronments are known to be having success monitoring 
their mail flow with Bogofilter using single, central- 
ized wordlists: 

e York University in Toronto recently deployed 
Bogofilter as a classifier for their environment of 
60,000 user accounts. Incoming mail volume is 
on the order of hundreds of thousands of mes- 
sages per day. At the time of writing, this imple- 
mentation was too new to have reliable numbers 
available, but early results are promising. In this 
implementation messages pass through Bogofilter 
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after DCC has already scanned them and 
rejected spams it can detect; approximately 
30-40% of this remaining incoming mail is ini- 
tially being flagged as spam by Bogofilter. This 
is an order of magnitude higher than the block 
rate of the SpamAssassin implementation that 
Bogofilter replaced, with a dramatically lower 
rate of false positives reported. 

A large ISP in Australia is using a modified ver- 
sion of Bogofilter with a single wordlist to 
watch 150,000 mailboxes. Over 1 million mes- 
sages are processed per day. Bogofilter is 
believed to be around 95% effective in this envi- 
ronment, with no false positives reported in six 
months of operation. The wordlist management 
is completely centralized, with no user input 
whatsoever. Administrators keep Bogofilter’s 
training current by manually scanning and train- 
ing on random samplings of 100-300 ‘“‘unsure”’ 
emails per week. 

Both of these deployments have followed implementa- 
tion and maintenance methodologies similar to the 
ones described in this paper. 


Conclusion 


Spammers have shown extreme willingness to 
adapt using any means at their disposal to spread their 
messages. Assumptions that they will be unwilling or 
unable to expend considerable resources or break the 
law to attack their targets are at best unfounded. 
Blocking techniques aimed at the mail routes and pro- 
tocols are not directed at immutable properties of 
spam and therefore seem unlikely to succeed and are 
likely instead to drive spammers to more and more 
illicit methods and increasing collateral damage. Paul 
Graham was correct in noting that the content of spam 
is the one thing which cannot be changed arbitrarily; it 
can be altered and obfuscated, but at some point the 
content still must be there, or there is no potential for 
any return on the spammers’ investment. Content- 
based filtering is therefore the best currently available 
approach to the problem, and our results show that 
Bayesian filters can be viable for the long term as 
adaptive content filters. 


Our results demonstrate that, far from being just 
another failed attempt at producing a comprehensive 
and sustainable solution to the spam _ problem, 
Bayesian filters can be sufficient without aid from sec- 
ondary methods. This is true even in large environ- 
ments with central control, provided these filters are 
implemented carefully and with a solid understanding 
of the theory supporting them. There is a tremendous 
difference throughout the IT field between systems 
which truly can not stand the test of time and those 
which are simply difficult or tedious to implement cor- 
rectly, and it is irresponsible to dismiss something 
with the latter property as though it had the former. 
We would all prefer it if we could block spam using 
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easily implemented and foolproof methods, but if 
complicated solutions are required to win this fight, 
doing our homework seems the least we can do given 
the severity of the problem. Bayesian methods deserve 
further development, testing, and careful consideration 
before they are dismissed due to false assumptions or 
incomplete understandings of their value and effec- 
tiveness. Organizations and individuals are encour- 
aged to implement similar solutions to the one detailed 
here and attempt to duplicate and verify our results. 


Do we think Bayesian filters are a permanent 
solution to all spam? No. If nothing else, Moore’s 
Law dictates that brute force attacks will eventually be 
viable. There is no reason to believe, however, that 
spammers will bother waiting that long just to imple- 
ment high-cost attacks. Based on their previous pat- 
terns they are much more likely to attempt to sidestep 
the issue entirely and move to less direct methods: the 
spam equivalent of side-channel attacks. We are 
already seeing signs of these, as spammers are using 
browser malware to inject ads directly into web site 
content and are increasingly relying on worms to create 
zombie machines to send their spam for them. At the 
paranoid extreme, a fairly obvious convergence of these 
practices would be worms which inject ads directly into 
outgoing user emails, turning every legitimate mail into 
spam at the same time. The current malware epidemic 
demonstrates that such a methodology would be a 
much more cost-effective practice for spammers than 
escalating the filter war indefinitely. No filtering tech- 
nology that only aims to block spam messages would 
be useful in stopping this type of spam, since any mail 
blocked would by definition be a false positive. SMTP 
replacements and authentication schemes would be 
similarly useless, since these messages would be sent 
through valid mail routes using appropriate credentials 
just as many worms are today. If and when we reach 
this point, however, we will no longer even be dealing 
with unsolicited commercial and bulk email. We will be 
dealing with something else. Filtering may be the most 
effective method for dealing with spam that is simply 
commercial email messages, but malware is something 
completely different. 


Regardless of the possible future of spam, the 
most value offered by effective Bayesian filters is to 
be gained today while spammers are still unable to cir- 
cumvent them and have not yet determined their next 
method of attack. The war against spam will no doubt 
continue to be fought with a range of methods across 
various platforms. No single technology will likely 
emerge which is capable of dealing with all attacks 
equally well. If the next phase does center around mal- 
ware, it will only be won if we can get ahead of the 
spammers now and get a handle on the current virus 
epidemic. If Bayesian filters are at least robust enough 
to win the filtering battle for the foreseeable future 
they will be invaluable in buying administrators des- 
perately needed time to move beyond filter mainte- 
nance to the much more serious problems of end-user 
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computer security. Instead of carelessly dismissing 
these filters as already beaten and wasting time pursu- 
ing weaker solutions, administrators should take advan- 
tage of the opportunities they provide to go on the 
offensive and finally gain an advantage over spammers 
before it is too late. 


Availability 


Vipul’s Razor is available at http://razor.source- 
forge.net/. 


Bogofilter is available at http://bogofilter.source- 
forge.net/. 


qmail-qfilter is available at http://untrou- 
bled.org/qmail-qfilter/ . 
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Appendix A 


The scripts below are what we used to create our initial wordlists and tune Bogofilter. This functionality can 
now be found in the scripts that ship with Bogofilter, but these are provided for illustration purposes and in the hope 


that they may be useful. 
#!/bin/sh 


# retrain.sh: Test various bogofilter.cf values. Create randomized message 
# lists, fully train on 10k each spam and non-spam, train on error for other 
## messages, test on 5k each spam and non-spam. Repeat until all messages are 
## trained on error twice, then output stats for each test. Log everything. 


# syntax: retrain.sh <logfiletag> 
ced /usr/share/bogofilter/retrain/ 


echo "removing old wordlists..." 
rm -f data/{good,spam}list.db 


if {. | =£ Ider |: then 
echo "making lists... 
./makelists.sh 1>&2 


i, 


echo "doing 10k spam..." 
while read 
do 

echo "S${REPLY}" 1>&2 


< "S${REPLY}" ./bogofilter -C -c ./bogofilter.cf -s 


done < 10k_spam_list.random 


echo "doing 10k notspam..." 
while read 
do 

echo "${REPLY}" 1>&2 


< "S{REPLY}" ./bogofilter -C -c ./bogofilter.cf -n 


done < 10k_notspam_list. random 


echo "doing rt round 1..." 


./rt.sh rest_list.random | tee ./rt."${1}"-1l.log | sed -f ./rt.sed 1>&2 


echo "doing tt round 1..." 


./tt.sh 5k_spam_list.random | tee ./tt."${1}"-1.log | sed -f ./rt.sed 1>&2 
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./tt.sh 5k_notspam_list.random | tee -a ./tt."${1}"-1l.log | sed -f ./rt.sed 1>&2 


echo "doing rt round 2..." 
./rt.sh rest_list.random | tee ./rt."${1}"-2.log | sed -f ./rt.sed 1>&2 


echo “doing tt round 2..." 
./tt.sh 5k_spam_list.random | tee ./tt."${1}"-2.log | sed -f ./rt.sed 1>&2 
./tt.sh 5k_notspam_list.random | tee -a ./tt."${1}"-2.log | sed -f ./rt.sed 1>&2 


echo “adding tt messages round 1..." 
./rt.sh 5k_list.random | tee ./rt."${1}"-3.log | sed -f ./rt.sed 1>&2 


echo "doing tt round 3..." 
-/tt.sh 5k_spam_list.random | tee ./tt."${1}"-3.log | sed -f ./rt.sed 1>&2 
./tt.sh 5k_notspam_list.random | tee -a ./tt."${1}"-3.log | sed -f ./rt.sed 1>&2 


echo "adding tt messages round 2..." 
./rt.sh 5k_list.random | tee ./rt."${1)"-4.log | sed -f ./rt.sed 1>&2 


echo "doing tt round 4..." 
./tt.sh 5k_spam_list.random | tee ./tt."${1}"-4.log | sed -f ./rt.sed 1>&2 
./tt.sh 5k_notspam_list.random | tee -a ./tt."${1}"-4.log | sed -f ./rt.sed 1>&2 


echo "doing stats..." 

echo "lst run:" 

getstate eh a/Et."S(L}"=1.. Log 
echo "2nd run:" 

/getstats.sh ./tt."S${1}"-2.log 
ecna’ "Sra ‘run: 

-/getetats.ich ./tt. "StI y"=3..log 
echo "4th run:" 

«/getetatse ch ./tt. "ST1)}"-4.16g 


#!/bin/sh 
## makelists.sh: Generate randomized message lists needed by retrain.sh. 


## This script requires rl, found here: 
# http://tiefighter.et.tudelft.nl/~arthur/rl1/ 


find /var/spam/corpii/{NOTSPAM,SPAM} -type f > list 
< list rl > list.random 


< list.random grep -i '/corpii/spam’ | head -10000 > 10k_spam_list. random 

< list.random grep -i '/corpii/notspam’ | head -10000 > 10k_notspam_list.random 
cat 10k_spam_list.random 10k_notspam_list.random | rl > 10k_list.random 

< list.random grep -i ’/corpii/spam’ | tail -5000 > 5k_spam_list.random 

< list.random grep -i ’/corpii/notspam’ | tail -5000 > 5k_notspam_list.random 


cat 5k_spam_list.random 5k_notspam_list.random | rl > 5k_list.random 
sort 10k_list.random 5k_list.random | grep -v -F -f - list.random | \ 
rl > rest_list.random 
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##!/bin/zsh 


# rt.sh/tt.sh: Process messages specified in provided file through Bogofilter. 
# As tt.sh, output Bogofilter’s classification. As rt.sh, output 
## classification and train Bogofilter on error. 


BOGOFILTER="/usr/share/bogofilter/retrain/bogofilter" 
BOGOFILTERCF="/usr/share/bogofilter/retrain/bogofilter.cf" 


< "${1}" | while read 

do 
printf "%41s: " ‘echo "S{REPLY}" | sed -e ’s%/var/spam/corpii/%%’ ‘ 
REAL=‘echo "${REPLY}" | sed -e ’s%/var/spam/corpii/\([*/]\+\)/.*%\1%" ' 
printé "47s8:-" “S$ (REAL}" 


BOGO=‘< "S{REPLY}" "S${BOGOFILTER}" -C -c "${BOGOFILTERCF}" -v‘ 
GUESS=‘echo "${BOGO}" | sed -e ’s/.*\(Spam\|Legit\|Unsure\).*/\1/'' 


SPAMICITY=‘echo "${BOGO}" | sed -e ’s/.*\(spamicity=[* ]\+\),.*/\1/"* 
printf "S6e: " "$ (GUESS) " 
printf "%20s\n" “${SPAMICITY)* 
if [{ "‘basename ${0}‘'" == "rt" J]; then 
case "${GUESS)" in 
"Spam" ) 
if [[ "S{REAL}" == "NOTSPAM" ]]; then 
< "${REPLY}" "S${BOGOFILTER}" -C -c "S{BOGOFILTERCF}" -n 
ti 
"Legit") 
if ([{ "${REAL}" == "SPAM" J]; then 
< "S${REPLY}" "S${BOGOFILTER}" -C -c "S${BOGOFILTERCF}" -s 
fi 
"Unsure") 
if [[ "S{REAL}" == "NOTSPAM" ]]; then 
< "S{REPLY}" "S${BOGOFILTER}" -C -c "${BOGOFILTERCF}" -n 
else 
< "S{REPLY}" "S${BOGOFILTER}" -C -c "S${BOGOFILTERCF}" -s 
fi 
esac 
$i 
done 


## rt.sed: Colorize output of rt.sh. 


a/\.( SPAMN\)2/*( [1 ¢3im\1"[ (ome 
s/\ (NOTSPAM\) :/*[[(32m\1*[[0Om: / 
s/\(Spam\)/*([1:31m\1*[[0m/g 
s/\(Legit\)/*[[32m\1* [[0m/g 
s/\(Unsure\)/*[[(7;34m\1*[[0m/g 


#!/bin/sh 
# getstats.sh: Output summary of results from retrain.sh-generated logs. 


echo "false positives: ‘grep ’'NOTSPAM: Spam:’ ${1} | we -1'" 
echo "false negatives: ‘grep * SPAM: Legit:’ ${1) | we -1‘" 
echo "unsure spams ; ‘grep ' SPAM: Unsure:’ ${1} | we -1‘* 
echo "unsure notspams: ‘grep *NOTSPAM: Unsure:’ ${1} | we -L*" 
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Appendix B 


The scripts below are used for filtering our messages during the SMTP exchange. 


#!/bin/sh 


# qq-qfilter: log the mail, then add and fix* the Bogofilter header, then 

## forward the mail to the caching server, then check if it’s spam and refuse it 
ff if it is. 

## A full description of qmail-qfilter’s operation is beyond the scope of this 

# paper, but in summary, it takes a mail message on stdin and pipes it through 
## a "--" delimited list of filters. Each filter’s stdout becomes stdin for the 
# next filter. Exit codes are checked and manipulated to provide the calling 

## qmail daemon what it expects. 


# *"Fixing" the Bogofilter header means making sure it is the first line in the 
## header. The version of Bogofilter we use does not place this header 
# consistently; this has reportedly been fixed in current versions. 


exec /usr/bin/qmail-qfilter /usr/local/bin/qfilter-logger -- \ 
/usr/bin/bogofilter -l -e -p -- 


/bin/sed -e °1,/*X-Bogosity:/({;’ -e °/*X*Bogosity:/1(; H: d; }3:” \ 
* "E-Bogosittys/ te pe ei De Ye 1” -- \ 
/usr/local/bin/qfilter-cache -- /usr/local/bin/qfilter-spamcheck 
#!/bin/sh 


# qfilter-spamcheck: read the first line of stdin as an X-Bogosity header; if 
# the message is spam, exit 31 to refuse delivery, otherwise pass the message 
# through unaltered. 


# This script requires rewind, which of part of DJB’s serialmail package and 
## can be found here: 
# http://cr.yp.to/serialmail.html 


if head -n 1 | grep -q '*X-Bogosity: Spam,’; then 
exit 31 
i 


/usr/local/bin/rewind 


cat - 
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ABSTRACT 


The system administrators of large organizations often receive a large number of e-mails 
from its users and a substantial amount of effort is devoted to reading and responding to these e- 
mails. The content of these messages can range from trivial technical questions to complex 
problem reports. Often these queries can be classified into specific categories, for example reports 
of a file-system that is full or requests to change the toner in a particular printer. In this project we 
have experimented with text-mining techniques and developed a tool for automatically classifying 
user e-mail queries in real-time and pseudo-automatically responding to these requests. Our 
experimental evaluations suggest that one cannot completely rely on a totally automatic tool for 
sorting and responding to incoming e-mail. However, it can be a resource-saving compliment to an 
existing toolset that can increase the support efficiency and quality. 


User Dynamics: Problem Statement 


Reading and responding to e-mails from disgrun- 
tled users in an organization takes up several hours of 
a system administrator’s daily effort. At the Engineer- 
ing faculty at Oslo University College, with some 
1,500 users, system administrators often encounter an 
inbox with hundreds of messages in the morning when 
arriving at work. The task of responding to these e- 
mail messages can be daunting, time-consuming and 
tedious. Yet, timely and quality replies are immensely 
important for the individual user in order for the users 
to fulfill their role in the organization. Technical diffi- 
culties and setbacks, that may seem trivial to a system 
administrator, can be overwhelmingly frustrating and 
destructive for the user and can be unnecessarily 
costly for the organization. Further, it is important that 
anomaly reports from users get the attention of the 
system administrators early such that corrective and 
preventive actions can be taken and to minimize the 
damage and the repercussions of the anomalies. It is 
therefore important that the system administrators’ 
inboxes are continuously monitored. 


Until very recently there have been few general 
formal training programs for system administrators 
with the exception of product specific certifications. 
Yet, the existing training and certification programmes 
primarily focus on technical issues. User support is 
usually learned on the job and there are often few offi- 
cial procedures for how to handle user queries. 
Attempts at [SO-certifying and standardizing proce- 
dures in organizations may result in some of these 
queries being formalized, but often it is superficial 
bureaucracy that, in terms of timeliness and respon- 
siveness, may actually reduce quality rather than 
improving it. Consequently, system administrators 
often develop their individual practices for handling 
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user queries, and often these ““common sense” prac- 
tices work quite well in practice. 


Organizations are dependent on a team of system 
administrators and these administrators can usually be 
reached through one point-of-call, such as sup- 
port@fantasticCorp.com. E-mails sent to such an 
address are commonly forwarded to the entire team of 
system administrators that are responsible for user 
queries. Two obvious reasons behind this strategy are 
that it is easier for the users to remember and the 
administrator on-duty will receive the message. How- 
ever, often individuals in an organization know one or 
more of the system administrators personally and may 
decide to send a message to this specific system 
administrator directly. This is (for the reasons stated) 
an undesirable practice. 


The size of the inbox is dependent on a number 
of factors. Two important factors are when the inbox 
was last inspected and the occurrence of certain sys- 
tem events. According to queuing theory, e-mails can 
be modeled as arriving in a stream with a Poisson dis- 
tribution, and therefore the size of the inbox is approx- 
imately proportional to the time passed since it was 
last read. This phenomenon is something that every- 
one coming to work in the morning or returning from 
lunch, meeting or even a business trip can testify. Fur- 
ther, the occurrence of certain events, such as a system 
failure or anomaly usually ignites a burst of e-mails. For 
example a printer breakdown on Friday afternoon does 
not go down well with users struggling to meet their 
deadlines. Once a system failure or anomaly affects the 
work carried out by the user, the users often chose to 
resolve the problem by contacting the system adminis- 
trators by e-mail, as it may be hard to reach the system 
administrator by phone. The more users an anomaly 
affects the more e-mails the system administrator 
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receives. Occurrences of anomalous events are hard to 
predict but they themselves may be modeled using a 
Poisson-like distribution. 


The messages from users can be coarsely classi- 
fied into four categories: a) automatically generated 
mails from various processes such as cfengine, at, 
cron, etc., b) unsolicited e-mail and spam, c) general 
questions and d) urgent questions, notifications and 
anomaly reports. Automatically generated mail is eas- 
ily sorted into assigned folders using simple keyword- 
based filters, and automatic spam filters are getting 
increasingly good at reducing the amount of junk in 
the inbox. General questions from users, such as how 
to do “this or that,”’ are often not urgent and the task 
of responding to the question can be delayed to a 
‘low-peak”’ time. Often, users ask similar questions 
and the system administrator can retrieve an old 
answer written to a different user at a previous point in 
time. In this way, the administrator hopefully saves 
time if this archived message can be recalled with lit- 
tle effort. However, urgent e-mails from users report- 
ing anomalies, faults and strange behaviour in the 
computer system should be read and dealt with imme- 
diately. Many system administrators in small and 
medium-sized organizations get a “feeling” for how 
to read and respond to e-mail. The purpose of this 
work is to assist the system administrator by simplify- 
ing e-mail management. DIGIMIMIR, based on text- 
mining techniques, automatically clusters and catego- 
rizes incoming e-mail into related topics and presents 
the e-mail categories in terms of identifiable and char- 
acteristic keywords. System administrators get a clear 
overview of the inbox contents, and can thus more 
easily identify urgent matters that need to be resolved 
at high priority. In addition, DIGIMIMIR can be con- 
nected to a reservoir of pre-answered questions such 
that the most suitable answers to commonly asked 
questions are found automatically. 


Previous Work 


Much has been written on helpdesk support [8, 
11, 13, 29] and many commercial systems exist, such 
as Ell and IssueTracker. These systems primarily 
focus on tracking historic aspects of customer support, 
maintenance of searchable knowledge bases and the 
identification of recurring issues. Many of these prod- 
ucts target general businesses. GNATS is a system 
specifically designed for tracking bugs in software and 
the maintenance of these in databases [28]. 


Trouble ticketing systems, such as OTRS (Open 
Ticket Request System), are useful tools for managing 
large inboxes, which may be handled by several sys- 
tem administrators concurrently. New requests that 
arrive in the inbox are given a ticket (e.g., a number) 
and an automatic reply informing the user that the 
request is received and will be handled. Other requests 
on the same issue are automatically grouped together 
with the existing correspondence related to the ticket. 
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The different system administrators therefore have 
easy access to the entire history of the requests associ- 
ated with a ticket. The responses of different system 
administrators can therefore be coordinated. 


Expert system based helpdesk systems have also 
been explored [35], where the staff running the 
helpdesk is guided through sequence of decision rules 
to solve the particular difficulties. A problem with 
expert systems is the nontrivial establishment of the 
decision rules. Another strategy is case based reason- 
ing which is especially suited at detecting recurring 
problems [4]. In the spoken domain recurrent neural 
networks have been used to route helpdesk calls auto- 
matically based on utterances [9]. 


The difficulties of handling large numbers of 
requests are commonplace in large organizations. 
Research effort has gone into the automatic retrieval 
of answers from existing question-answer lists based 
on queries, such as the VIENA classroom system, 
which uses lexical similarity [32, 33], the FAQFinder 
system [1, 20] which uses semantic knowledge and the 
FAQIQ system which uses case based reasoning [22]. 
In a different approach [30] a template strategy is used 
to answer questions based on information in relational 
databases. Common to all these strategies is that they 
are based on already existing knowledge bases. 


In addition to the distribution of answers it is 
also necessary to categorize questions. The idea of 
automatically sorting e-mail into predetermined cate- 
gories has been examined by several researchers. In 
one theoretical study [36] web-mining agents are 
assessed as a means of automatically sorting e-mail 
using an uncertainty probability based sampling clas- 
sifier and rough relation learning. 


Recently, there has been a huge public interest in 
the problems associated with spam, and a substantial 
effort has gone into developing spam-filtering technol- 
ogy [1, 34]. Notably, by far the most efficient strategy 
is the statistically based nave Bayesian classification 
[34]. A nave Bayesian system is trained using a large 
corpus of spam e-mails and non-spam e-mails and a 
word signature vector is established for both groups. 
When a new message arrives, its word signature is 
compared to both that of the spam and the non-spam 
signatures and the one that yields the best match deter- 
mines its category. The spam-filtering problem is 
related to the problem we are addressing in this paper. 
However, it is different on two major accounts. Firstly, 
spam-filtering entails classifying messages into two 
groups, either spam or non-spam. Second, these cate- 
gories are defined a-priori. However, in document 
classification there are many (usually more than two) 
categories and the categories might not be known and 
therefore have to be established at run-time. Bayesian 
classification has in fact also been studied in the con- 
text of general document classification [24]. One 
exciting practical development that has occurred in 
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parallel with the development of DIGIMIMIR is POP- 
FILE, which is a Bayesian based e-mail classification 
program [12]. POPFILE works directly on POP3 e- 
mail accounts and uses Bayesian classification to sort 
incoming messages into predetermined categories. 
POPFILE as a software product has reached some 
maturity, but still suffer from major shortcomings — 
the most notable (at time of writing) is the lack of 
IMAP support. The IMAP protocol is particularly 
applicable in the context of automatic mail sorting 
applications since mail folders can be managed on the 
server [5], and it is therefore possible to achieve e- 
mail client independence. Further, POPFILE is 
designed only to work in supervised mode and is 
reliant on training data and therefore cannot be 
deployed in unsupervised situations where new cate- 
gories are created dynamically on the fly. 


Another exciting and controversial new develop- 
ment is the announcement of Google’s new e-mail ser- 
vice Gmail [10]. This is a novel strategy in terms of 
managing e-mail. The idea is that the users get a large 
area to store their mails and that e-mails rarely have to 
be deleted. Document classification techniques are 
therefore used to navigate and search through this huge 
reservoir of e-mails. This service is not yet available to 
the general public, but this new thinking with regards to 
dealing with e-mail management may prove to be useful 
in terms of support and helpdesk e-mail management. 


In fact, the largest success of text mining and doc- 
ument classification and retrieval technologies has been 
within the areas of web search engines [21, 25] and 
search engine technologies have developed at a rapid 
pace over the last few years. However, there is a num- 
ber of profound differences between the clustering and 
classification requirements for indexing web pages and 
handling streams of incoming e-mail. First, the largest 
difference is probably the volume of text. The web is 
huge and the size can be exploited to up the quality of 
the clustering and classification. Personal e-mail collec- 
tions or an organization’s e-mail collection remain small 
by comparison. Second, indexing of web pages can be 
done offline and there are no critical time-restrictions on 
how fast the pages are indexed. Indexing is often done 
offline and there is a significant turnaround time from 
when a change is being made on a document on the 
Internet until it is being reflected in the search results of 
the search engines. We could perhaps describe this as an 
index epoch. However, in order for an e-mail sorting 
system to have a value, the classification and clustering 
needs to be continuous, instant and real-time. Third, in 
the indexing of web pages quality is the prime impor- 
tance, while in the clustering and classification of e-mail 
messages quality can be traded for speed. Until now, 
most of the research into text mining and document 
classification and retrieval primarily focus on clustering 
and classification quality and they pay less attention to 
the real-time operational constraints and the incremen- 
tally growing document collections (inboxes). 
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Finally, some system administrators wisely 
believe that the best support policy is self-reliance and 
that users should be able to resolve their own prob- 
lems as much as possible [7] and hence reduce the 
need for support. 


DIGIMIMIR and Text-mining 


DIGIMIMIR employs techniques borrowed from 
web mining (see [3] for a general introduction to web 
mining) and terminology mining based on text corpora 
(see for example [15]). DIGIMIMIR takes a set of 
messages as input and produces a set of message clus- 
ters as output, i.e., related messages are clustered 
together, and unrelated messages are placed in differ- 
ent clusters. The step comprises several phases. First, 
the system must retrieve the messages, then each mes- 
Sage is pre-processed and transformed into a text vec- 
tor. The set of text vectors representing the set of mes- 
sages are used as input to the clustering algorithm so 
as to compute the most suitable clusters and finally the 
results are presented to the recipient (user). 


A dedicated e-mail account is set up and DIGIM- 
IMIR polls the inbox at regular intervals to check for 
new incoming messages. The inbound messages are 
processed as follows: Each message is treated as a 
separate entity and used as a basis for computing a 
word vector. A word vector represents a set of words 
as a vector, where each word in a dictionary is 
assigned a specific position in the vector, thus the size 
of the vector equals the size of the dictionary used. 
The presence of a word is marked by a positive non- 
zero integer, where the value represents the number of 
times the word occurs in the text. A zero denotes the 
absence of a specific word. Cleary, different messages 
containing different words have different word vectors 
in the word space. Hence the phrases “nuts and bolts”’ 
and “‘bolts and nuts” would both yield the same word 
vector. 


To generate a word vector the text is organized 
into individual words. The first step is to filter high 
frequency words, also known as stop words [31, 14]. 
This is achieved using a stop word dictionary com- 
puted from word frequency lists [17]. Next, word 
stemming is employed to obtain the general form of 
words [27] with the purpose to reduce the size of the 
word vectors. Then, a dictionary-based spell checking 
technique is used. I.e. a reference dictionary compris- 
ing of all the possible valid words in the language in 
all grammatical forms is used. If an entry in the text 
dictionary cannot be found in the reference wordlist 
then the entry is tagged as a potential incorrect 
spelling. All entries marked as potential incorrectly 
spelled words are crosschecked against the reference 
wordlist using Metaphone [26]. Metaphone is a tech- 
nique, inspired by SOUNDEX for matching words 
based on their English phonetic sound and it is partic- 
ularly suitable for spell checking applications. Entries 
with no Metaphone match in the dictionary are 
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considered special terms. Entries with a metaphone 
match are most likely incorrectly spelled words (see 
[19] for an excellent survey of automatic spelling cor- 
rection techniques). Then, each instance of the remain- 
ing words is counted and represented in the word vec- 
tor. The word vector therefore represents the poten- 
tially interesting words that are characteristic of the 
question [6]. Further, the word vectors can be large 
and contain large amounts of noise. Research into 
text-mining often strives to reduce the dimensionality, 
or size, of the word vectors by the means of some 
transformation [34]. In DIGIMIMIR a simple word 
vector reduction technique is employed, namely word 
masking. For each new vector that is being presented 
to the system only the words present in the given word 
vector are considered when computing the distance to 
the other words in the clusters. I.e. all words that are 
present in documents to be compared are discarded if 
they are not also present in the document they are 
compared to. This mechanism prevents unimportant, 
and most probably, unrelated words to influence the 
distance measure. Without dimensionality reduction, 
or word masking, then the auxiliary words may unnec- 
essarily add to the distance between two word vectors 
that in practice are quite similar. 


The word vectors are clustered using the K-means 
algorithm — a classic and widely known and effective 
clustering algorithm (see [6]). In clustering the words 
are represented as vectors in the word space, and the 
purpose of the clustering algorithm is to assign word 
vectors that are similar to the same cluster in the vector 
space and assign word vectors that are different to dif- 
ferent clusters. Two vectors are similar if the distance 
between the two vectors is small, and two vectors are 
dissimilar if their distance is large. One popular dis- 
tance measure is the Eucludean distance. The K-means 
algorithm works as follows: a set of vectors is to be 
clustered into K clusters. Initially, the vectors are 
assigned arbitrarily to the K clusters. Then the mean 
vector for each cluster is computed. Next, the vectors 
are reassigned to the cluster to which they are closest 
and the cluster means are recomputed. The process is 
repeated until some convergence criteria are met. 


Finally, the results of the clustering algorithm are 
presented to the user as a pre-catalogued set of mes- 
sages. Each time a new message or a group of mes- 
sages arrives into the system, the process is repeated 
incorporating the new messages into the clusters. 


Quality Measurements 


It is relatively easy to assess the quality of classi- 
fication tasks when they are applied to a training set, 
as this is a form of supervised learning, where the cat- 
egories or clusters are predetermined. One can simply 
compute the success rate as the number of messages 
that are correctly assigned a cluster, and the error rate 
as the number of messages that are incorrectly 
assigned. However, assessing the quality of clustering 
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is more difficult as there is no given mapping between 
the training category and the assigned cluster. We 
therefore deployed the following strategy: Each docu- 
ment d; belongs to a training category c; and after the 
algorithm is deployed it is assigned a cluster k;. After 
training a category-to-cluster matrix is established 
where the columns represent the training categories and 
the rows represent the assigned clusters. An element at 
column 7 and row / denotes the number of documents 
from category c; that has been assigned cluster f;. 


The quality measures are then computed in three 
steps. First, for each category the entire column is 
scanned for the largest value and this is marked as a 
category-to-cluster mapping. Second, for each cluster 
the entire row is scanned for the largest value which 
again is marked as a category-to-cluster mapping. If 
other elements in the row are also marked as a cate- 
gory-to-cluster mapping (in the first step) then these 
are remarked as ‘“‘undecided.” Third, the three quanti- 
ties are computed as follows: the success rate is com- 
puted by summing all elements marked as category-to- 
clusters and dividing by the total number of docu- 
ments, the failure rate is the sum of all non-zero 
unmarked elements divided by the total number of 
documents and finally the ratio of ambiguous mes- 
sages is the sum of all elements marked as “unde- 
cided”’ divided by the total number of documents. 


Test Suites 


In order to assess and document the effectiveness 
of the system through a repeatable experiment, one is 
dependent on a test suite with pre-categorized mes- 
sages. The Reuters-21578 Text Categorization Collec- 
tion [23] is a well-known and widely cited test suite, 
comprising of Reuters news articles from 1987 that 
have been classified and indexed by experts and later 
made available for research purposes. These news arti- 
cles comprise medium to long, well written, pieces of 
English text. A news report is long compared to e-mail 
messages that often are short and poorly written with 
spelling mistakes and various abbreviations. The 
Reuters collection is therefore not completely repre- 
sentative of the problem domain. Further, we were 
unable to use this test suite as the current implementa- 
tion of DIGIMIMIR is optimized for Norwegian. To 
manually categorize messages is time-consuming, dif- 
ficult and error-prone. We therefore deployed three 
strategies for obtaining test-suites: 


First, a small hand crafted test suite, comprising 
of 100 messages, was used for early testing. This set 
contains manually categorized fictitious messages, 
characterized as being easy to cluster and classify. 


Second, the students at Oslo University College 
were used to create the second set of messages from 
an online “quiz,” comprising 160 messages. Four 
themes with 10 entries each were created. Each entry 
comprised a picture and a statement, such as a picture 
of well-known politicians or some hi-tech device. The 
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students were asked to submit a question related to the 
entry via a web form. Thus the questions received 
could therefore be tagged with the given category. 


Third, a set of messages from our local UNIX 
system administrator was collected and manually 
organized into natural categories and labeled. Unfortu- 
nately, this test set was small and contained a diverse 
set of messages, since the system administrator limited 
the selection (out of concern for privacy and security) 
and considered the task low priority. Secondly, as this 
paper was finalized during late spring early summer, 
there was not that much traffic as the peak usually 
occurs in the autumn when new students enroll onto 
the college courses and acquire accounts on our com- 
puter system. 


Implementation Details 


The implementation consists of the following 
highly configurable Java components: Controller-mod- 
ule: Controls the flow of information through the sys- 
tem as the modules can be interconnected in an arbi- 
trary manner, 1.e., the output of one module can sent to 
the input of another module etc. Messages are pro- 
cessed as they travel through the various modules on 
their path to their destination. A typical message-path 
comprises an input-module, a set of pre-parsers, a 
parser-module, a set of filters, a sorter-module and an 
output-module. Multiple instances of a module can be 
created and used in different parts of the message chain. 


The modules are glued together by the means of 
RMI (Remote Method Invocation) and a global con- 
figuration file, allowing various modules to reside on 
different machines in a distributed manner. This fea- 
ture allows systems with a high degree of interactivity 
to be configured, i.e., the system can be configured to 
provide immediate feedback to instant inbound mes- 
sages. A special message-path configuration is 
required, which does not contain conventional input 
and output modules. This implementation also allows 
modules to be interconnected to form a tree-structure 
where all non-leaf nodes are routers that forward mes- 
sages to yet other modules. For instance, a controller 
can be configured to identify the language of a mes- 
sage and forward the message to the corresponding 
controller for the target language of that message. 
Messages can be rejected and re-routed to different 
branches of the graph if problems are detected by 
modules higher up in the tree. 


Input-modules retrieve and convert external data 
into the internal representation used by the system. 
The current set of input modules can read POP3 (Post 
Office Protocol) and IMAP (Internet Mail Access Pro- 
tocol) mailboxes, and external JDBC-compliant data- 
base tables provided the correct fields are appropri- 
ately configured. 


Preparsers modify the incoming messages before 
they are tokenised into sentences and words. Removal 
of HTML-tags is a common pre-parser filter operation. 
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Parsers tokenise texts into sentences and words. A mes- 
sage must be parsed before further processing can take 
place. 


Filters alter or remove words or phrases from 
messages. For example, a spellchecker replaces incor- 
rectly spelled words with their corresponding correctly 
spelled words, while a stop-word filter removes words 
that are not relevant to content of the message. 


Sorter-modules attach answers to incoming mes- 
sages, or signals to the controller that there are no 
matching answers to the incoming question. 


Output-modules return messages back to their 
source provided they belong to the same category. For 
example, POP3 and IMAP input messages are 
returned via the SMTP (Simple Mail Transfer Proto- 
col) output module. Further, the database input mes- 
sages can be returned to another external JDOBC-com- 
pliant database using references created by the input 
module. 


GUI (Graphical User Interface)-controllers are 
equipped with swing-based GUI components which 
allows for direct user-controller-interaction. The cur- 
rent GUI controller can be connected to multiple con- 
trollers simultaneously. A HTML-based web interface 
is not currently provided, but is planned for a future 
release. The application is completely written in Java, 
currently using MySQL 3.23 for persistent storage via 
Hibernate. The application is developed and tested 
using Sun’s JRE 1.4.2 (Java Runtime Environment). 


Results 


Figures 1, 2, 3, and 4 show the results obtained 
running DIGIMIMIR on two datasets. The figures illus- 
trate the accumulative performance of the system, i.e., 
how the system state changes as messages are added to 
the system. The horizontal axes indicate the number of 
messages in the system and the vertical axes represent 
percentage correct answers, incorrect answers, ambigu- 
Ous or uncertain answers and new clusters. Figure | 
shows the results obtained using the quiz dataset in 
unsupervised mode (160 messages), i.e., without train- 
ing, and Figure 2 shows the results using the quiz 
dataset in supervised mode (80 messages), i.e., with 
training. One half of the dataset was used for training 
and the other half was used during testing. Figure 3 
shows the handcrafted dataset in unsupervised mode 
and Figure 4 shows the hand-crafted dataset in super- 
vised mode. The messages were shuffled into pseudo 
random order for all the test runs. 


All the figures show similar trends, namely that 
the success rate, the error rate and the percentage of 
new clusters converge as more messages are added to 
the system — and this is adhering to expectations. The 
quiz data reveals a clear difference between the unsu- 
pervised and the supervised runs. In unsupervised 
mode we achieve a successful classification rate of 
nearly 50% and an error rate of just below 40%. The 
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percentage of ambiguous or uncertain messages is 
converging quickly to approximately 15%. Further, 
the probability percentage of generating a new cluster 
converges early at just over 10%. The results achieved 
in supervised mode are nearly twice as good. First, the 
success rate converges at around 80%, which is nearly 
a doubling in quality compared to unsupervised mode. 
This result is consistent with similar experiments 
reported in the literature on document classification. 
Second, the failure-rate is converging at 20%, which 
again is a halving in the number of incorrectly 
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classified messages compared to unsupervised mode. 
Further, the rate of ambiguous or uncertain messages 
converges early at around 10%. It is also interesting to 
observe that the probability of generating a new clus- 
ter is converging much slower in supervised mode 
compared to unsupervised mode and that it is converg- 
ing at a higher value of approximately 20% compared 
to 10% for unsupervised mode. 


The shape of the graphs in the figures smoothly 
either decrease (error-rate and clustering probability) 
or increase (success-rate), but at certain points there 


erroneously sorted messages 

correctly sorted messages 

undetermined sorted messages 
resulted in creation of a new category ~~ 


eo - 
o*s - 
etter"? ate 


* 
ot? eel Pd 


omar" see 


100 120 140 160 


average after n messages 


Figure 1: Cumulative success rates, error rates, uncertainty rates and clustering probabilities using the quiz test 


suite in unsupervised mode. 





fraction (x100) 


0 10 20 30 
average after n messages 


Figure 2: Cumulative success rates, error rates, uncertainty rates and clustering probabilities using the quiz test 
suite in supervised mode. 
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appear to be discontinuities in the graphs resulting in 
temporal setbacks. These discontinuities mark the 
arrival of very dissimilar messages that result in new 
clusters being established. When new clusters are 
added the classification landscape is altered and leads 
to a temporary classification instability and slightly 
lower success rates. These messages can be genuine 
and naturally belonging in new clusters or they may 
simply be irrelevant or noisy messages. 


Similar observations can be made in Figures 3 
and 4. Figure 3 shows the accumulative success, error 
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and clustering rates for the custom-made test data in 
unsupervised mode with very high success rates, and 
Figure 4 shows the same dataset in supervised mode. 


Operational Issues 


It would seem natural to integrate DIGIMIMIR 
with a trouble ticket system such as OTRS. In fact, a 
trouble ticketing system could be a good front end to 
DIGIMIMIR. At the entry point all incoming mes- 
sages are passed through a spam-filter, to remove 
noise in the input stream, and then the e-mails are 
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Figure 3: Cumulative success rates, error rates, uncertainty rates and clustering probabilities using the hand crafted 


test suite in unsupervised mode. 
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Figure 4: Cumulative success rates, error rates, uncertainty rates and clustering probabilities using the hand crafted 


quiz test suite in supervised mode. 
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directly handled by DIGIMIMIR. DIGIMIMIR would 
immediately classify the message and identify if there 
are any suitable categories, i.c., a match with a high 
probability of belonging to the particular cluster. If 
there is a good match then DIGIMIMIR would respond 
with a auto-reply if one exists in the system. If one of 
the tickets associated with a message in the cluster is 
assigned a response, then that response is also returned 
to the client. If there is no suitable response, or the 
probability is below a given threshold then the message 
is assigned a cluster and a standard ticket notice is 
returned to the user. These unprocessed messages are 
later addressed by a second line of system administra- 
tors. Each cluster is therefore analogous to a work 
queue, and once one general answer is generated for the 
messages in the cluster, all the recipients are forwarded 
the general reply. There should also be room to give 
specific and independent responses to a particular ticket 
in a queue and mark it as such. This would prevent a 
specific answer to be used as a general answer for auto- 
matic distribution. Further, is should also be easy to 
move a ticket manually to a different cluster if it is 
detected that DIGIMIMIR has misclassified the mes- 
sage. Further, it should be possible to manually estab- 
lish new clusters and manually combine existing clus- 
ters that have been automatically established. 


Future Work 


There are many improvements that can be made 
to DIGIMIMIR. Firstly, the language specific modules 
need to be internationalized. An absolute medium 
addition is support for the English language. Ulti- 
mately, the tools should be easily extended to support 
any western language, such that it can be configured 
using only a standard wordlist for the language, for 
example the wordlists found on most Unix/linux-type 
systems, and a stop word list. These are all relatively 
easily accessible. The main challenge is the word 
stemming algorithms that must be tailor made for each 
language. Stemming algorithms have been published 
for most western languages, but one needs to know the 
language at least in order to evaluate the effectiveness 
of these algorithms. Another challenge is_picto- 
graphic-based scripts such as Chinese. Chinese com- 
puter lingo is a good mix of Chinese characters and 
English terms, at least judging from Chinese Com- 
puter Magazines. There is a vast literature on Chinese 
language processing, and some answers may be found 
there. POPFILE claims to handle Chinese messages. 


Further, we only had the opportunity to experi- 
ment with a few clustering techniques. Document 
classification is an ongoing research topic and better 
algorithms are continuously published. And _ the 
DIGIMIMIR tools will most certainly benefit from 
improved clustering and classification strategies. 


A crucial next step of development is extended 
IMAP support such that the folders facilities on IMAP 
servers can be exploited. In addition to polling mes- 
sages off the IMAP server, DIGIMIMIR should use 
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the messages in the existing folders on the server dur- 
ing the classification and consequently move mes- 
sages from the inbox to the correct folders on the 
server. When new clusters are established, DIGIM- 
IMIR should create new folder on the server too. This 
will allow standard IMAP clients, such as Thunder- 
bird, to be used as front ends to DIGIMIMIR and 
users would require no additional training. The users 
will see the new messages in the respective folders 
and the newly established folders. Further, by manu- 
ally moving messages from one folder to another, the 
user can correct the classification engine and manually 
affect the classification and clustering. This would 
abolish the need for the DIGIMIMIR GUI component. 
However, we have had experiences with IMAP clients 
that do not register folder changes on the IMAP server 
that are made by other IMAP clients. 


A crucial next step of development is extended 
IMAP support such that the folders facilities on IMAP 
servers can be exploited. In addition to polling mes- 
sages off the IMAP server, DIGIMIMIR should use 
the messages in the existing folders on the server dur- 
ing the classification and consequently move mes- 
sages from the inbox to the correct folders on the 
server. When new clusters are established, DIGIM- 
IMIR should create new folder on the server too. This 
will allow standard IMAP clients, such as Thunder- 
bird, to be used as front ends to DIGIMIMIR and 
users would require no additional training. The users 
will see the new messages in the respective folders 
and the newly established folders. Further, by manu- 
ally moving messages from one folder to another, the 
user can correct the classification engine and manually 
affect the classification and clustering. This would 
abolish the need for the DIGIMIMIR GUI component. 
However, we have had experiences with IMAP clients 
that do not register folder changes on the IMAP server 
that are made by other IMAP clients. Another long- 
term improvement would be to introduce the idea of 
collaborative DIGIMIMIR systems, perhaps via a cen- 
tral body or directory, analogous to how virus filters 
update themselves on a regular basis. Many of the 
problems faced by system administrators are indepen- 
dently of organization, but rather dependent on a par- 
ticular version of a product, for example a security 
patch for the apache web-server. As system adminis- 
trators encounter problems and manually establishes 
categories for these problems, the information can 
then be shared with other DIGIMIMIR clients via one 
or more central word-vector reservoirs. Then, when 
other system administrators encounter similar prob- 
lems at a later date, they can benefit from the work 
already carried out by the first system administrator 
and automatically obtain the new category, perhaps 
even with a suggested response. However, there are 
obvious privacy and quality issues that need to be 
addressed in order to deploy such a strategy. 


Another, interesting possibility would be to inte- 
grate the system with a network monitoring and 
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alerting system such as [PSentry. Such system often 
has many channels of notification including e-mail. 
Obviously, a message from such a system is very eas- 
ily detected by DIGIMIMIR and is correctly classified 
as such a system produces messages with uniform 
wording and format. More interestingly, notifications 
from network monitoring and alerting systems could 
be correlated with user messages. This could greatly 
help a system administrator more easily assess the 
impact of a given network anomaly. 


Finally, DIGIMIMIR could benefit from a thor- 
ough review with regards to efficiency. Currently, the 
system has been tested with hundreds of messages 
with no apparent performance bottlenecks. However, 
large organization could easily receive a thousand 
messages each day, and maybe keep millions of mes- 
sages on record. There are no apparent time or space 
complexity issues that prevent the system from scal- 
ing. The major bottleneck of the system is the k-means 
clustering algorithm, which has a linear time-complex- 
ity with respect to the number of messages 
(O(c k dn), where k is the k-value, d is the number of 
dimensions for the documents, n is the number of doc- 
uments and c is the number of iterations required) 
[16]. Further, the distributed nature of the DIGIM- 
IMIR framework allows it to be configured to run in a 
distributed manner across a network of workstations, 
such that the inherent parallelism can be exploited. 


Implications of Automated E-mail Support 


The deployment of automatic document classifi- 
cation technology in the context of sorting incoming 
e-mails and automatically providing answers must be 
done with care. Nearly all requests are unique and the 
quality of the responses is best maintained by handling 
the requests manually. However, the quality of a sup- 
port service is a tradeoff between the quality of the 
response content and its timeliness. A support depart- 
ment that does not respond to requests or the 
responses are answered weeks after they were origi- 
nally sent can be frustrating to the users as they do not 
feel that their request is taken seriously. On the other 
hand, a rapid meaningless or obviously auto-generated 
incorrect reply can be equally frustrating and infuriat- 
ing to the user and embarrassing to the organization. 
This may result in aggressive users. It is very difficult 
to make a foolproof system, i.e., a system that never 
incorrectly classifies a message and respond with the 
answer to a different question. To minimize the dam- 
age one can adopt psychological techniques, such as 
using humble wording in the responses. For instance, 
in DIGIMIMIR we used the following careful wording 
in the auto-generated replies: ““We found that your 
question X is very similar to question Y. One answer 
to question Y is Z. Note that this response is automati- 
cally generated and a human will evaluate this 
response hopefully quite soon.” 


The optimal policy is probably a mixture of man- 
ual responses and automated responses. Manual 
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responses should be used during low-peak periods for 
messages with a lower classification probability, while 
the automatic response mechanisms are to be used 
during peak hours when there is not sufficient manual 
resources to handle the stream of incoming messages. 
The administrator can then later inspect the requests 
that arrived during the peak time and assess the 
responses, and then send corrections to users when 
appropriate. Such a post-peak period inspection is still 
more efficient than having to respond to every mes- 
sage manually. It will in some situations be easier to 
spot a message that is out-of-place when it is placed 
together with other messages that are related. Ulti- 
mately, the inspection cycle is necessary as the 10 to 
20% messages are incorrectly classified and needs to 
be handled manually. If one receives 500 messages a 
day, then this will account to 50 messages, and if the 
organization receives 5000 messages a day, this will 
obviously account to as much as 500 messages. 


Politically speaking, an automated system is 
desirable as it is resource-saving. As long as the finan- 
cial gains of deploying such a system are larger than 
the negative impacts of the errors introduced, an orga- 
nization is likely to embrace such technology. A con- 
sequence of this is that in the next instance the system 
administrators may be budgeted with even fewer 
resources by the decision makers. 


Availability 


DIGIMIMIR is constantly under development 
and is released under a GPL license. Its binaries, 
source code and documentation can be downloaded 
from http://www.digimimir.org/. 


Conclusions 


In this paper we have addressed the problem of 
e-mail helpdesk support. Text-mining techniques were 
explored as means of partially automating the support 
tasks and a tool dedicated to this task was presented. 
Our experiments confirm that it is possible to achieve 
50% or more correctly classified messages in unsuper- 
vised mode and 80% correctly classified messages in 
supervised mode. This can greatly help support staff 
reduce their workload, especially when combined with 
an auto-response feature. In operation, the system can 
exploit a mixture of supervised and unsupervised 
mode. Messages that are manually approved or classi- 
fied messages can be used in a supervised manner, 
while new clusters can be established dynamically and 
unsupervised in order to get clear overview reports of 
totally new situations. We believe that document clas- 
sification technology to a greater extent will be incor- 
porated into the broad range of e-mail handling sys- 
tems in the years to come due to the great potential for 
reducing the workload, but to ensure quality there 
should be a human in the loop. Further, such technol- 
ogy could also help reduce the emergency response 
time of a support team as emergency messages can be 
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more quickly identified and separated from less urgent 
requests. 
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ABSTRACT 


Spyware is a rapidly spreading problem for PC users causing significant impact on system 
stability and privacy concerns. It attaches to extensibility points in the system to ensure the spyware 
will be instantiated when the system starts. Users may willingly install free versions of software 
containing spyware as an alternative to paying for it. Traditional anti-virus techniques are less 
effective in this scenario because they lack the context to decide if the spyware should be removed. 


In this paper, we introduce Auto-Start Extensibility Points (ASEPs) as the key concept for 
modeling the spyware problem. By monitoring and grouping “hooking” operations made to the 
ASEPs, our Gatekeeper solution complements the traditional signature-based approach and 
provides a comprehensive framework for spyware management. We present ASEP hooking 
statistics for 120 real-world spyware programs. We also describe several techniques for 
discovering new ASEPs to further enhance the effectiveness of our solution. 


Introduction 


Spyware is a generic term referring to a class of 
software programs that track and report computer users’ 
behavior for marketing or illegal purposes. In addition, 
spyware may actively push advertisements to the user 
by popping up windows, and change the Web browser 
start page, search page, and bookmark settings. Spyware 
often silently communicates with servers over the Inter- 
net to report collected user information, and may also 
receive commands to install additional software on the 
user’s machine. Users infected with spyware commonly 
experience severely degraded reliability and _perfor- 
mance such as increased boot time, sluggish feel, and 
frequent application crashes. Reliability data shows that 
spyware programs account for fifty percent of the 
overall crash reports [FTC04]. Saroiu, et al. [SGL04] 
point out security problems caused by vulnerabilities 
in spyware programs. A recent study based on scan- 
ning more than one million machines show the alarm- 
ing prevalence of spyware: an average of four to five 
spyware programs (excluding Web browser cookies) 
were running on each computer [E04A, E04B]. 


Current anti-spyware solutions [AA, SB] are pri- 
marily based on the signature approach used by anti- 
virus software: each spyware installation is investi- 
gated to determine its file and Registry signatures for 
use by scanner software to detect spyware instances. 
This approach has several problems. 


First, many spyware programs may be consid- 
ered “legitimate” in the following sense: their compa- 
nies sponsor popular freeware to leverage their instal- 
lation base; since users agree to an End User Licens- 
ing Agreement (EULA) when they install freeware, 
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removing the bundled spyware may violate this agree- 
ment. In many cases, the freeware ensures the spyware 
is running on the user’s system by refusing to run if its 
bundled spyware is removed. 


Second, the effectiveness relies on completeness 
of the signature database for known spyware. Beyond 
the difficulty of manually locating and cataloging new 
spyware, this approach is further complicated because 
spyware are full-fledged applications that are gener- 
ally much more powerful than the average virus [C04], 
and can actively take measures to avoid detection and 
removal. Companies creating spyware generate rev- 
enue based on the prevalence of their applications and 
therefore have a financial incentive to create technolo- 
gies that make it hard to detect and remove their soft- 
ware. They have the need and the resources to invest 
in developing sophisticated morphing behavior. 


Third, some spyware installations may contain 
common library files that non-spyware applications 
use. If care is not taken to remove these files from the 
spyware signatures, scanners using these signatures 
may break non-spyware applications. 


Finally, popular spyware removal programs are 
commonly invoked on-demand or periodically, long 
after the spyware installation. This allows the spyware 
to collect private information and makes it difficult to 
determine when the spyware was installed and where 
it came from. A monitoring service that catches spy- 
ware at installation time is essential for reducing expo- 
sure and avoiding re-infection. 


To complement the signature-based approach, we 
introduce the concept of Auto-Start Extensibility Points 
(ASEPs) as the key to spyware management. Our work 
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is based on the observation that, in order to monitor 
users’ behavior on an ongoing basis and to maximize 
the time window for monitoring, an overwhelming 
majority of spyware programs infect systems in such a 
way that they are automatically started upon reboot and 
the launch of most commonly used applications. We use 
the term ASEPs to refer to the subset of OS and applica- 
tion extensibility points that can be “hooked” to enable 
auto-starting of programs without explicit user invoca- 
tion. An ASEP may accept one or more ASEP hooks, 
each of which is associated with an auto-start program. 


We distinguish two types of ASEP hooking: (1) as 
a standalone application that is automatically run by 
registering as an OS auto-start extension such as a Win- 
dows NT service or a Unix daemon; or (2) as an exten- 
sion to an existing application that is either automatically 
run (such as WinLogon.exe with its Notify extensions) or 
popular and commonly run by users (such as the Internet 
Explorer browser with its Toolbar extensions). 


Figure 1 depicts a Windows-based systems with 
three layers of gates. The Outer Gates are the entrance 
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points for program files from the Internet to get on 
user machines. The Middle Gates are the ASEPs that 
allow programs to hook a system to essentially 
become “part of the system” from a user’s point of 
view. The Inner Gates control the instantiation of pro- 
gram files. Our solution, named Gatekeeper, identifies 
and monitors the Middle Gates and exposes all ASEP 
hooks to allow effective management of spyware. 


Problem Formulation and Decomposition 


Figure 2 illustrates the “‘life cycle” of the spy- 
ware management process and provides a problem 
decomposition that enables us to systematically reason 
about this complex problem. Note that our current 
solution does not address the issues of malicious soft- 
ware such as RootKits [P03]; we will briefly discuss 
malicious behavior in the Discussions section. 


In Step (1), given a spyware-infected machine, 
since we do not have sufficient context information for 
already-installed spyware programs, we rely on the sig- 
nature-based scanning and removal tool (such as Ad- 
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Figure 1: Outer, Middle, and Inner Gates: (1) Outer Gates are the entrance points for program files from the Inter- 
net to get on user machines. User Consent includes explicit consent to install, for example, a freeware program, 
and implicit consent to allow spyware programs bundled with the freeware to get installed as well. Incorrect Se- 
curity Settings include the “‘Low” setting for Internet Zone security, incorrect entries in the Trusted Sites list, 
and incorrect entries in the Trusted Publishers list, which would allow drive-by downloads; (2) Middle Gates 
are the ASEPs that allow programs to survive reboots and maximize their chance of running all the time. BHO 
stands for Browser Helper Object. LSP stands for Layered Service Provider; and (3) Inner Gates control the in- 
stantiation of program files into active running program instances. They include CreateProcess, LoadLibrary, 
and other program execution mechanisms, and can be used to block any potentially harmful programs if they 


are not properly signed or on the known-good list. 
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Aware and SpyBot-S&D) to remove existing spyware. 
After Step (1), the Gatekeeper infrastructure is put in 
place to provide a spyware management framework. 


In Step (2), we continuously monitor all ASEPs 
by recording, alerting, and blocking potentially unde- 
sirable ASEP hooking operations. It is essential that 
the signature database includes user-friendly descrip- 
tions of known-good [G03, NSRL, PP] and known- 
bad ASEP hooks to enable presentation of actionable 
information to the user. 


If the user decides to install a freeware applica- 
tion after assessing the risks of bundled spyware pro- 
grams as specified in the EULA, bundle tracing in 
Step (3) captures all components installed by the free- 
ware and display them in Gatekeeper as a group with a 
user-friendly name enabling the user to manage and 
remove them as a unit. 


In Step (4), we monitor the performance and relia- 
bility of the system since the bundle installation and as- 
sociate any problems with the responsible compo- 
nent(s). These “credit reports” provide the user with a 
“price tag” for the freeware functionality, enabling the 
user to make value/cost judgments about the freeware. 


Finally, our solution’s effectiveness is directly 
related to completeness of the ASEP list. In Step (5), we 
discover new ASEPs of OS and popular frequently-run 
software by either analyzing indirection patterns in file 
and Registry traces or troubleshooting infected machines. 
In this paper, we will cover (2), (3), and (5) in the next 
three sections. 
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ASEP Management 


ASEP Categorization 


On Windows platforms, most of the ASEPs 
reside in the Registry. Only a few of them reside in the 
file system. We have found it useful to classify ASEPs 
into the following categories: 

1) ASEPs that start new processes: for example, 
the HKLM\SOFTWARE\Microsoft\Windows\ 
CurrentVersion\Run Registry key and the 
%*USERPROFILE%\Start Menu\Programs\ 
Startup file folder are well-known ASEPs for 
auto-starting additional processes. 

2) ASEPs that hook system processes: for exam- 
ple, HKLM\SOFTWARE\Microsoft\Windows NT\ 
CurrentVersion\Winlogon\Notify allows a DLL 
to be loaded into WinLogon.exe. 

3) ASEPs that load drivers: for example, HKLM\ 
System\CurrentControlSet\Control\Class\ 
{4D36E96B-E325-11CE-BFC1-08002BE10318}\ 
UpperFilters allows loading of a keylogger 
driver; HKLM\System\CurrentControlSet\ 
Services allows loading of general drivers. 

4) ASEPs that hook multiple processes: for 
example, Winsock allows a Layered Service 
Provider (LSP) DLL or a Name Space Provider 
(NSP) DLL to be loaded into every process that 
uses Winsock sockets; .HKLM\SOFTWARE\ 
Microsoft\Windows NT\CurrentVersion\Windows\ 
AppInit Dlls allows a DLL to be loaded into 
every process that links with User32.dll. 


Step (5) 


ASEP Discovery 
Through Trace Analysis 


ASEP Discovery 
Through Troubleshooting 


é , = Qe; € 
(al | Py) Freeware . 








Cleaned-up 
Behavior Monitoring Bundle Management 
For Credit Report & Removal 
Generation 
Step (3) 


Figure 2: The Spyware Management “Life Cycle’ and Problem Decomposition: see descriptions in the Problem 


Formulation and Decomposition section. 
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5) Application-specific ASEPs: for example, 
HKLM\SOFTWARE\Microsoft\Internet Explorer\ 
Toolbar allows a toolbar to be loaded into the 
IE browser; HKCR\PROTOCOLS\Name-Space 
Handler and HKCR\PROTOCOLS ‘Filter allow 
other kinds of DLLs to be loaded by IE; 
HKLM\SOFTWARE\Microsoft\Internet Explorer\ 
Search\SearchAssistant and CustomizeSearch 
take URLs as input and control which search 
pages will be loaded. 


ASEP Hooking Statistics 


Figure 3 shows the number of spyware hooks to 
each of the 34 ASEPs hooked by at least one of the 
120 spyware programs in our Spyware Zoo. Browser 
Helper Objects (BHOs), HKLM “Run” key, and IE 
“Toolbar” are the three most popular ASEPs. Figure 4 
shows that most of the individual spyware programs 
hook only three or less ASEPs, but some hook as 
many as 13 or 17. When spyware and freeware pro- 
grams are bundled together in a single installation, it is 
not uncommon to see that a single bundle hooks 10 or 
more ASEPs, which would usually cause significant 
performance degradation. (Note that a freeware pro- 
gram may not have any ASEP hook if it is to be manu- 
ally launched by the user as needed, but spyware pro- 
grams always have ASEP hooks.) 
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ASEP Monitoring and Alerting 


ASEP monitoring watches all known ASEPs for 
any of the following three types of changes: (1) adding 
a new ASEP hook; (2) modifying an existing ASEP 
hook; and (3) modifying the executable file pointed to 
by an existing ASEP hook. 


Each of the above changes generates a new event 
log entry that contains the ASEP pathname, the ASEP 
hook name, the executable file pathname or URL, and 
the timestamp of the hooking operation. Optionally, a 
notification can be displayed to the user or forwarded to 
an enterprise management system for processing. Notifi- 
cations for ASEP programs signed by trusted publishers 
can be optionally suppressed to reduce false positives. 


Figure 5 shows a screenshot of a user notifica- 
tion alert. During the installation of a freeware screen- 
saver, the user is notified of five new ASEP hooks. 
The “Screen Saver” hook alert is obviously expected. 
Searching the Signatures and Descriptions Database 
with the information from the other four alerts (by 
clicking on the alerts) reveals that they belong to 
“eXact Search Bar” and “Bargain Buddy.” Based on 
the information provided for these two pieces of soft- 
ware and the benefit provided by the screensaver, the 
user can then make informed decision about whether 
to keep this bundle. 








Auto-Start Extensibility Points (ASEPs) 
Figure 3: Distribution of spyware ASEP hooks: 120 spyware programs with 334 hooks to 34 ASEPs; ASEPs are 


sorted by popularity. 
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Bundle Management 


The term ‘bundle’ represents a set of applications 
and extensions added to a user’s system as part of a 
single installation process. If a component of a bundle 
installs additional applications or components at a 
later time, these are also added to the installer’s bun- 
dle. A bundle is intended to match an end user’s ideal 
management unit for installing, disabling, and remov- 
ing software on their system. 


Bundle Tracing 


Although multiple ASEP alerts appearing during 
a single installation typically indicate that the ASEPs 
belong to the same bundle, this time-based grouping is 
not robust against concurrent installations. For exam- 
ple, Figure 6 illustrates two concurrent installations of 
the DivX bundle (with two ASEP hooks) and the 
Desktop Destroyer (DD) bundle (with five ASEP 
hooks). Time-based grouping would incorrectly group 
all seven ASEP hooks in a single bundle. 


Gatekeeper uses a bundle tracing technique built 
on top of the always-on Strider Registry and file trac- 
ing [WVD+03, DRD+04]. ASEP hooks created by 
processes belonging to the same process tree are 
assigned to the same bundle. If any Add/Remove Pro- 
grams (ARP) entries are created by any process in the 
tree, the concatenation of their ARP Display Names is 
used to label the bundle. Referring to Figure 6, the 
upper process tree defines the DivX bundle with two 
ARP names, and the lower tree defines the DD bundle 
with three ARP names. 


Any spyware that does not provide an ARP entry 
for removal will show up as a bundle with no name. 
For example, the ClientMan software creates one 
ASEP hook silently at installation time with no 
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accompanying ARP entry. Since installations without 
ARP entries are uncommon, this installation will be 
flagged as potentially unwanted. 


We have observed that some spyware may ini- 
tially install partially, and delay the full installation 
until a later time to make it more difficult for the users 
to identify which Web site is actually responsible for 
installing the software. For example, after the partial 
installation with one ASEP hook, ClientMan would 
non-deterministically select a later time after several 
reboots to finish its installation with seven additional 
ASEP hooks. 


Gatekeeper bundle tracing captures such devious 
behavior as follows. First, it performs URL tracing to 
link each Web-based bundle installation with its 
source URL. Although IE browser history already 
records the URL and timestamp for every Web site 
visited, it is a global history for all instances of IE and 
is garbage collected after a few weeks. We have 
implemented a Browser Helper Object to record the 
process ID of the IE instance that navigated to each 
URL so that the URL trace can be correlated with the 
ASEP hooking trace. Second, to handle latent installa- 
tions, bundle tracing keeps track of all the files created 
by each bundle. If any of the files are later instantiated 
to create more ASEP hooks, these additional hooks are 
added to the original bundle. 


Extensibility Point Add/Remove Programs (EP- 
ARP) 


Figure 7 shows Gatekeeper displaying bundle 
information through a new “‘Manage Auto-Start Pro- 
grams” button in the Control Panel ARP interface 
(called it EP-ARP). It scans all ASEPs and displays all 
current hooks by bundles. Users can also choose to 
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Figure 4: Number of ASEP hooks used by each spyware. 
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sort the ASEP hooks by the timestamps obtained from 
the event log in order to highlight newly installed 
ASEP programs. This is particularly useful when a 
user invokes EP-ARP immediately after she observes a 
problem to identify the potential problematic program. 


The EP-ARP display also provides three options 
for bundle removal/disabling. For example, the bundle 
name clearly shows that ‘‘eXact Search Bar’ and 
“Bargain Buddy” have been installed as part of the 
DD bundle. If the user wants to remove DD, she can 
click the “Disable Bundle” button and reboot the 
machine. This removes all five ASEP hooks, stopping 
the three bundled programs from automatically start- 
ing, despite their files remaining on the machine. 


Alternatively, the user can look for the three 
ARP names in the regular ARP page and invoke their 
respective removal programs there. Since it is not 
uncommon for spyware to provide unreliable ARP 
removal programs, the user can double-check EP-ARP 
to make sure that none of the ASEP hooks gets left 
over after ARP removals. Gatekeeper also integrates 
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with System Restore [SRO1], as shown at the bottom 
of Figure 7. If both removal options fail, the user can 
click on the “Restore” button to roll back machine 
configuration to a System Restore checkpoint taken 
before the bundles were installed. 


ASEP Discovery 


In addition to well-known ASEPs and docu- 
mented ASEPs, we discover new ASEPs through 
another two channels. The first channel involves trou- 
bleshooting machines with actual infections that can- 
not be cleaned up by Gatekeeper because of spyware 
using unknown ASEPs. We provide two tools for this 
purpose: the Strider Troubleshooter [WVD+03] and 
the automatic AskStrider scanner [WRV+04]. The sec- 
ond channel involves analyzing Registry and file 
traces collected from any machine to discover new 
ASEPs that can potentially be hooked by future spy- 
ware. Once new ASEPs are discovered, they are added 
to the Gatekeeper list to increase its coverage. The 
same ASEP discovery procedure can also be used by 


1244 Desktop Destroyer FREE ¥ 


Wait: | 10 S | minutes []On resume, password protect 


Monitor power 


To adjust monitor power settings and save energy, 
fim 





Figure 5: ASEP Hooking Alerts: One freeware screensaver (the bottom alert) bundling two other programs, each 


hooking two ASEPs (the other four alerts). 
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system administrators to discover ASEPs in third- 
party or in-house applications that do not come with a 
list of specified ASEPs. 


ASEP Discovery Through AskStrider 


The AskStrider scanner is an enhanced Windows 
Task Manager. In addition to displaying the list of run- 
ning processes, AskStrider displays the list of modules 
loaded by each process and the list of drivers loaded 
by the system. More importantly, AskStrider gathers 
context information from the local machine to help 
users analyze this large amount of information to iden- 
tify the most interesting pieces. Such context informa- 
tion includes the System Restore file change log, 
meta-data for patch installations, and driver-device 
associations [WRV+04]. 


Figure 8 shows two sample screenshots of 
AskStrider. The upper pane displays the list of pro- 
cesses sorted by the approximate last-update timestamps 
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of their files, according to System Restore. Files that 
were updated within the past week are highlighted. 
The lower pane displays the list of modules loaded by 
the selected process in the upper pane, with the same 
time-sorting and highlighting. Additionally, if a file 
came from a patch, the patch ID is displayed as an 
indication that the file is much less likely to have 
come from a spyware installation. 


Also illustrated in Figure 8 is an example of how 
AskStrider was used to discover a new ASEP. Figure 8 
(a) shows that, after the installation of SpeedBit, a new 
process DAPexe was started and the browser process 
iexplore.exe was loading four newly updated DLL files 
from the same installation. After we disabled all new 
ASEP hooks from Gatekeeper EP-ARP and rebooted 
the machine, iexplore.exe was still loading two new 
DLLs as shown in Figure 8 (b). Searching the Registry 
using the filename DAPIE.dil revealed that SpeedBit 
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Figure 6: DivX and Desktop Destroyer Bundle Tracing: solid arrows represent creations of child processes; dashed 
arrows represent creations of ARP entries; dotted arrows represent creations of ASEP hooks. Each process tree 
defines the scope of the bundle, named by concatenation of ARP friendly names. 
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was hooking an additional ASEP under HKCR\ 
PROTOCOLS\Name-Space Handler, which has since 
been added to the ASEP list monitored by Gatekeeper. 


ASEP Discovery Through Strider Troubleshooter 


The strength of AskStrider is that the scanning is 
completely automatic and typically takes less than a 
minute to run. The weakness is that it only captures 
running processes and loaded modules at the time of 
its scan. If a spyware program gets instantiated 
through an unknown ASEP and exits before 
AskStrider is invoked, AskStrider may not be able to 
capture any information revealing the unknown ASEP. 


The Strider Troubleshooter [WVD+03] can capture 
such behavior in an ‘“‘auto-start trace’ that records 
every single file and Registry read/write during the 
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auto-start process. This tool asks the user of an 
infected machine to select a System Restore checkpoint 
(of files and Registry) that was taken before the infec- 
tion. By comparing that checkpointed state with the cur- 
rent infected state, the tool calculates a diff set that con- 
tains all changes made by the spyware installation. Then 
it intersects the diff set with the auto-start trace to pro- 
duce a report that contains all ASEP hooks made by the 
spyware installation and accessed during auto-start. 


For example, in the case of Praize Desktop, 
HKCU\Control Panel\Desktop\Wallpaper was a previ- 
ously unknown ASEP that allows running an HTML 
file as a desktop picture. It did not show up in 
AskStrider, but it showed up in the Strider Trou- 
bleshooter report as a newly discovered ASEP. 
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Figure 7: Extensibility Point-Add/Remove Programs (EP-ARP): the “DivX Pro Codec Adware | DivX Player” 
bundle includes two ASEP hooks GMT.exe and CMESys.exe that came from Gator. The “Desktop Destroyer 
FREE | eXact Search Bar | Bargain Buddy” bundle includes five ASEP hooks. Clicking on the “‘Restore”’ but- 
ton at the bottom can roll back the system and remove the two bundles. 
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Figure 8: AskStrider for ASEP discovery. 
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ASEP Discovery Through Strider Trace Analysis 


By definition, ASEP programs must (1) appear 
in the ‘‘auto-start trace” that covers the execution win- 
dow from the start of the booting process to the point 
when the machine “finishes all initializations and is 
ready to interact with the user’; and (2) get instanti- 
ated through an extensibility point lookup, instead of 
having their instantiation hard-wired into other pro- 
grams that are auto-started. 


New ASEPs can therefore be discovered by ana- 
lyzing the auto-start trace from any machine to iden- 
tify the following indirection pattern: an executable 
filename is returned as part of a file or Registry query 
operation, followed by an instantiation of that exe- 
cutable file. 


In an experiment, we collected auto-start traces 
from five Windows XP machines for analysis. By 
looking for the indirection pattern, we were able to 
validate some of the known ASEPs in our list and dis- 
cover 17 new ASEPs (including five ASEPs for a 
third-party, auto-start anti-virus program). There are 
three distinctive classes of patterns: 

1) ASEPs that accommodate multiple hooks: for 
example, HKLM\SOFTWARE\Microsoft\InetStp\ 
Extensions allows for multiple administrative 
extensions for the IIS server; HKLM\SOFTWARE\ 
Microsoft\Cryptography\Defaults\Provider allows 
for multiple providers; HKLM\SOFTWARE\ 
Microsoft\Windows NT\Current Version\Winl- 
ogon\Userinit allows for multiple initialization 
programs specified in a comma-separated 
string. 

2) ASEPs with a single hook: for example, the 
ASEP HKCR\Network\SharingHandler appears 
to allow only one handler. 

3) ASEPs that require multiple indirections for 
lookup: for example, every hook to the ASEP 
HKLM\SOFTWARE\Microsoft\Windows\Current 
Version\ShellServiceObjectDelayLoad contains 
a Class ID that is used in an additional Registry 
lookup to retrieve the executable filename from 
HKCR\CLSID\<Class ID>\InProcServer32. 


We have observed a couple of interesting cases 
where our analysis may produce “false-positive” 
ASEPs in the sense that it is arguable whether they 
should be included in our list for monitoring. First, 
some DLL files do not export any functions and are 
only used as resource files to provide data; so they 
may not be considered ASEPs. But they are still 
potential ASEPs if a D//Main routine can be added to 
cause code execution. 


Another case is organization-specific ASEPs. For 
example, all the machines in the same organization 
may run an auto-start program deployed by its IT 
department that exposes its own ASEPs. Obviously, 
such ASEPs should not be added to the global list; but 
the system administrators in the organization may 
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want to add them to their local list if they are con- 
cerned about these ASEPs being hooked. 


Discussions 


Although Gatekeeper is proven to be effective 
against today’s spyware, there are many different ways 
in which spyware can evolve to evade detection. In 
this section, we discuss such limitations and potential 
future work to address them. 


Limitations of ASEPs 


In general, the following problem is intractable: 
given the static persistent-state image stored on 
a hard drive, determine what code will be exe- 
cuted when a machine is booted into the OS 
image stored on that drive. 
Ideally, we need to trace all executions by actually 
booting into that OS and recoding all processes, mod- 
ules, drivers, and code segments that are loaded or 
injected. 


We introduced the concept of ASEPs in the con- 
text of spyware management as an approximation to 
the non-existing solution to the above problem. This 
approximation has at least five limitations: 

1) Definition of “popular and commonly run 
programs”: beyond OS programs, we have 
included only the Web browser for ASEP con- 
sideration. Many commercial or freeware appli- 
cations may have a sufficiently large install 
base and running frequency that make them 
attractive spyware targets. 

2) Cascading ASEP programs: any ASEP pro- 
grams can provide their own custom ASEPs for 
other programs to hook. So, in theory, there can 
be chains of an infinite number of ASEPs that 
allow cascading auto-starts of ASEP programs. 
It is also possible for an ASEP program to 
serve as a custom task scheduler that allows 
spyware programs to be launched after any def- 
inition of the “‘auto-start phase.”’ 

3) Non-ASEP auto-start programs: The ASEP- 
based approach does not capture programs that 
auto-start through non-extensibility mecha- 
nisms. Although we have not seen many spy- 
ware programs infecting system files directly 
today, that approach may be popular among 
Trojans [104] and may become popular among 
spyware programs once Gatekeeper exposes all 
ASEP hooks. We need to rely on additional sig- 
nature or file-hashing mechanisms to protect sys- 
tem files. If a malicious spyware program uses 
code injection and thread hijacking to evade 
detection, ASEP monitoring should still be useful 
in capturing the first instantiated spyware pro- 
gram. In theory, it may also be possible for a pro- 
gram to hide inside an input file and get instanti- 
ated when the file is read by an auto-start pro- 
gram by exploiting code vulnerabilities. 
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4) ASEP hijacking: Gatekeeper assumes that the 
underlying operating system has not been com- 
promised, so the list of ASEPs on a spyware- 
infected machine is the same as that on a clean 
machine. It is possible that a malicious spyware 
program can “hijack” the ASEPs by replacing 
system files; essentially, the machine can be 
considered to be running a different operating 
system in such cases. For example, a Web post- 
ing [CARO04] describes a way to modify the 
binary file of explorer.exe to create arbitrary 
ASEPs. In our current work, we consider such 
malicious programs targets of anti-virus pro- 
grams, not Gatekeeper. In our future work, we 
plan to rely on digital signatures and file hashes 
to verify that the underlying operating system is 
not compromised. 

5) ASEP hook hiding: Another way for malicious 
software to defeat ASEP-based scanning is to 
intercept all file and Registry query operations 
and remove the software’s own ASEP hooks 
from the query results before they are returned 
to Gatekeeper. Many RootKits are known to 
provide such capability [P03]. There have been 
recent reports that an open-source RootKit is 
being used to hide spyware programs from anti- 
spyware tools [HD]. We plan to augment Gate- 
keeper with an external scanning mechanism, 
which is required to combat such malicious 
programs that essentially take over the entire 
machine once they get started [WVR+04]. 


Finally, the operating system can be configured 
to auto-start programs based on generated system 
events resulting from the insertion of removable media 
like CDs, hot-pluggable hardware like USB key rings, 
etc. However, as long as they require explicit user 
actions to connect the media to the machine and are 
not automatically started upon reboot, they are not 
considered ASEPs in this paper. 


Bundle Management Challenges 


Our bundle tracing technique assumes that the 
spyware installation programs that we monitor do not 
try to intentionally confuse or maliciously attack Gate- 
keeper. One can imagine that a deceptive spyware 
could hijack ARP and ASEP hook entries of other 
good software so that they are incorrectly included in 
the bad bundle. A malicious spyware running with 
administrator privileges could even disable the bundle 
tracer or modify the recorded bundle information. 


There are two additional challenges that do not 
involve malicious behavior. First, since our bundle 
tracer does not track inter-process communications, any 
bundle installation that involves communications 
between multiple process trees may appear as multiple, 
separate bundles. Second, if two toolbars from two dif- 
ferent bundles are loaded into the browser at the same 
time and one of them expands the bundle by hooking an 
additional ASEP, process tree-based tracking will not 
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provide sufficiently fine-grain information to determine 
which bundle the new ASEP hook should belong to. 


Limitations of AskStrider 


AskStrider extracts the last-update timestamps of 
files from the System Restore file change log and is 
therefore subject to its limitations. First, System Restore 
only monitors files with certain filename extensions 
[SRM]; spyware programs with extensions outside the 
monitored set will not be captured by AskStrider s high- 
lighting of recent changes because their updates will not 
be captured in the file change log. 


Second, System Restore excludes certain tempo- 
rary folders from monitoring. As a result, file updates 
inside those folders will not be captured in the change 
log. Third, a malicious spyware program may delete 
or corrupt the file change log or hide its processes and 
modules from the AskStrider scan. 


Finally, a main feature of AskStrider is that it 
highlights recent changes to aid troubleshooting. We 
have found that such filtering mechanisms are essen- 
tial for reducing the complexity in many systems man- 
agement problems. An obvious limitation of this 
approach is that it will not work well if a user invokes 
AskStrider long after a spyware installation. 


Other Web Browsers and Non-Windows Platforms 


In this section, we study the ASEPs and spy- 
ware-related issues in other Web browsers and operat- 
ing systems. 

Other Web Browsers 


As shown in the Outer Gates in Figure 1, code 
vulnerabilities can be used as an infection vector for 
spyware through “drive-by downloads.” The problem 
is not unique to the IE browser; other browsers also 
have known vulnerabilities, such as Mozilla [KVM] 
and Netscape [HN00]. Exploits of vulnerabilities in 
the Mozilla and Firefox Web browsers have been 
widely publicized in news articles [M04, MO04, BZ]. 
Also, Secunia Advisories [SEC] often describe vulner- 
abilities that affect the Opera and Mozilla Web 
browsers. 


Other browsers also expose extension mecha- 
nisms similar to Windows ActiveX as another infec- 
tion vector for spyware through plug-in installations 
with explicit or implicit user consent. Those affecting 
Mozilla have used a .xpi file which is essentially a .zip 
file containing a JavaScript installer and the 
files/directories to install [CNPM]. An example of this 
is the Flingstone XPI extension at http://www2?.fling- 
stone.com/cab/sbc_netscape.xpi which contains 
installs and sbc_netscape.exe. When installed through 
Mozilla Firefox, Flingstone adds several ASEP hooks to 
the BHO and HKLM Run key. This infects the Win- 
dows OS in addition to IE although it does not appear 
to infect Firefox itself [DSM04]. The bundle includes 
the well-known software bridge.dll [FB04]. We also 
found a Flingstone-clone ist_netscape.xpi containing 
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install.js and istinstall_netscape.exe and exhibiting 
essentially the same behavior. 


Just like IE, other Web browsers expose ASEPs 
that can potentially be hooked by spyware. For example, 
Mozilla Firefox has a file system-based ASEP at C:\Pro- 
gram Files\Mozilla Firefox\plugins; all plug-in DLL 
files placed in that directory are automatically loaded by 
Mozilla. It also scans a Registry-based ASEP at 
HKLM\SOFTWARE\MozillaPlugins to locate plug-ins 
that register with the browser through PLIDs [PLUG]. 


The homepage and search page related ASEPs of 
non-IE browsers are generally stored in application 
specific preference files rather than the Registry. For 
example, there are two user preference files in the pro- 
file directory of Netscape/Mozilla: prefs.js that con- 
tains automatically generated default preferences, and 
user.js which contains options that override settings in 
prefs.js. Spyware can hijack the home page and the 
default search page of these browsers by altering the 
value of user_pref(‘‘browser.startup.homepage,”’ 
‘““<home page>’’) and user_pref(‘‘browser.search. 
defaultengine,” ‘““<search page>”’) in prefs.js [NCPI]. 
For example, the Lop.com software has been known 
to hijack Netscape/Mozilla home page [LOP]. In gen- 
eral there appears to be less spyware targeting non-IE 
browsers, presumably because their smaller install 
base is less attractive to spyware developers. 


In some cases, the search and download func- 
tionalities of the browser software itself may raise 
similar privacy concerns. It was reported [NNBO2] 
that, while data on searches conducted from IE’s 
search pane was sent directly to the designated search 
site and was not intercepted by Microsoft, searches 
performed by using Netscape Navigator’s Search but- 
ton were intercepted by Netscape and tagged with 
information that can potentially identify individual 
machines. The term “File Download Spyware” 
[FDS00] refers to those file downloaders that by 
default track user’s entire file download history tagged 
with a unique ID, the machine’s IP address, or even 
the user’s personal email address. 


Non-Windows Platforms 


ASEPs on UNIX operating systems such as 
Linux, AIX, and Solaris can be roughly classified into 
four categories: 

1) The inittab and rc files: The file /etc/inittab 
instructs the init process what to do when the 
system is up and initializing. It typically asks 
init to allow user logons (geftys) and start all 
the processes in the directories specified by the 
/etc/re.d/re file and other rc files such as 
/etc/re.d/rc.local, which is a common place for 
the root user to customize the system, including 
loading additional daemons. 

2) The crontab tool: The cron daemon is started 
from either the rc or the rc./ocal file, and pro- 
vides task scheduling service to run other pro- 
cesses at a specific time or periodically. Every 
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minute, cron searches /var/spool/cron for entries 
that match users in the /etc/passwd file and also 
searches /etc/crontab for system entries (note that 
any modification to this file requires root privi- 
leges.) It then executes any commands that are 
scheduled to run. 

3) Configuration profiles for user environment 
(such as .bash for bash shell, .xinitrc or .Xde- 
faults for X environment, and other profiles in 
/etc/) are potential ASEPs. Users are typically 
unaware of what is loaded when they log on, or 
start an X window session. A simple script file 
that contains the command 

script -fq /tmp/.syslog 
can be used to hook an ASEP to record the ter- 
minal activities of the whole system or a spe- 
cific user account, depending on the ASEP 
location. The recording is usually stored in a 
hidden file (i.e., a filename that begins with a 
*.”’) under the global-writable /tmp directory 

4) Loadable Kernel Modules (LKMs) are units 
of object code that can be dynamically loaded 
into the kernel to provide new functionalities. 
By default most LKM object files are placed in 
the directory /lib/modules. However, some cus- 
tomized LKM files can reside anywhere on the 
system [LKMP]. The programs insmod and 
rmmod are responsible for inserting and remov- 
ing LKMs, respectively. 


Our preliminary investigation shows that spy- 
ware is not a substantial threat to the current 
Unix/Linux world. Perhaps this is because Unix/Linux 
has a much smaller install base than Windows in the 
consumer desktop market, which makes it less attrac- 
tive to spyware writers. Another reason might be that 
most Unix/Linux users do not run as administrators; 
many, if not most, of the spyware programs require 
administrator privileges to install and run. Finally, 
Unix/Linux users who do run as administrators are 
advanced users who are unlikely to fall into the trap of 
installing spyware. 


Related Work 


Earlier versions of commercial anti-spyware pro- 
grams focused on the signature-based, on-demand 
scanning approach. The latest Ad-Aware Ad-Watch 
real-time monitor [AP] and Spybot-S&D TeaTimer 
[ST] provide real-time monitoring similar to Gate- 
keeper ASEP monitoring. But they do not seem to 
include centralized auditing and bundle tracing, and 
the context information that they provide to the users 
is limited, making them less effective as a manage- 
ment solution as compared to Gatekeeper. On the 
other hand, they put more emphasis on blocking and 
protection. The Autoruns tool [AR04] and the Win- 
dows XP SP2 IE Add-on Manager both cover only a 
subset of ASEPs known to be hooked by spyware. 
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An alternative approach to combating spyware 
programs is to cut off their communications with remote 
servers so that collected personal information will not be 
sent out. One way to achieve this is to use the Hosts file 
to map all blacklisted host names to the local loopback 
address [BUP]. This approach essentially applies 
known-bad signatures to the host names and similarly 
lacks the context information for proper spyware man- 
agement. Moreover, it addresses only the privacy issue, 
not the reliability and performance issues. 


Saroiu, et al. [SGLO04] presented a measurement 
study of four widespread spyware programs in a uni- 
versity environment by analyzing a week-long trace of 
network activity. Their results showed that the spy- 
ware problem is of large scope. They also described a 
specific vulnerability in actual spyware programs to 
demonstrate that the potential for spyware to introduce 
substantial security problems is real. 


Summary 


In this paper, we have modeled the spyware man- 
agement problem as an ASEP tracking and bundling 
problem. We have described the Gatekeeper solution 
that provides visibility into important system changes 
and answers the following critical questions for every 
potential spyware program: 

1) *‘Where did it come from?” Our URL source 
tracing identifies the Web site from which the 
program was downloaded; bundle tracing iden- 
tifies the freeware that bundled the spyware. 

2) “When was it installed, where was it installed, 
and what was installed?” Our ASEP monitor- 
ing detects and records the installation events, 
and context lookup determines which file is 
installed where. 

““How does it get instantiated?” The ASEP to 
which the program is hooking determines how 
it will get auto-started. 

**How do I disable/remove it?” Our extended 
Add/Remove Programs user interface exposes 
each ASEP hook and allows simple disabling; 
alternatively, the ARP entries that are bundled 
with the ASEP hooks can be used for removal. 


With these capabilities, the Gatekeeper tool in its 
current form is useful for technical users and system 
administrators to gain back control of their machines 
and to effectively manage spyware. But there remains 
one critical piece to the puzzle to make the tool useful 
and make the presentation actionable by average 
users: 

5) “What does it do?” This will require detailed 
experiments and analysis of the program and 
matching the program’s behavior against a list 
of objective criteria. It will then allow the user 
to make an informed decision about whether to 
remove the program, based on the trade-off 
between the benefit and the potential pri- 
vacy/security/reliability/performance concerns 
of all the bundled programs. 
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ABSTRACT 


The number of security vulnerabilities discovered in computer systems has increased explosively. 
Currently, in order to keep track of security alerts, system administrators rely on vulnerability 
databases such as: CERT Coordination Centre, Securityfocus BugTraq and Sans Vulnerabilities 
Notes Database. Such databases are designed primarily to be read and understood by humans. Given 
the speed at which an exploit becomes available once a vulnerability is known, and the frequency of 
occurrence of such vulnerabilities, manual human intervention is too slow, time-consuming and may 
not be effective. We propose the design of a new vulnerability database which is oriented to be 
machine readable and processable rather than human oriented. This allows automated response to a 
vulnerability alert rather than relying on manual intervention of system administrators. With this 
approach, many kinds of automatic processing of alerts become feasible. We show the value of such 
a database by constructing a prototype sample scanner for Unix systems tailored for Linux RedHat 
and FreeBSD. We envisage that our work can help spur a development of far more effective 


vulnerability databases to benefit a wide-ranging user community. 


Introduction 


A worrying trend in the age of the Internet is the 
increasing incidence of cyber attacks. CERT statistics 
[1] quotes 114,855 reported incidents (an incident may 
involve an arbitrary number of sites, even thousands) 
in the first nine months of 2003 alone. This is a large 
jump from 21,756 incidents in 2000. 


One of the objectives of computer security emer- 
gency centers like CERT is to help disseminate vulner- 
ability alerts and relevant advisory notes to the user 
community in a timely fashion. However, the speed of 
cyber attacks together with the complexity of adminis- 
trating computer and network infrastructures today, 
makes it difficult for many system administrators to 
cope with such attacks. While automatic tools may be 
available, there is still a need to routinely inspect any 
security/vulnerability alerts in order to take the neces- 
sary corrective measures. 


Current sources of such alerts are designed pri- 
marily for human consumption and contain large 
amounts of information in natural language format. In 
this paper, we will call such sources, vulnerability 
databases, because they deal with collections of data 
and not whether they are actually kept in a database 
form or not. While a human oriented format is useful 
for disseminating the full details of an alert, it also 
requires a human in the chain to make use of it. This 
problem is acknowledged in a CERT document [2]. 
Given the 5500 vulnerabilities reported in 2002, it is 
estimated that a system administrator would need 229 
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days just to digest the information. Furthermore, usu- 
ally multiple vulnerability databases need to be con- 
sulted to fully deal with a vulnerability, 1.e., just the 
CERT entry is not sufficient. Thus, the deck is stacked 
on the side of the hackers rather than the system 
administrators. 


Clearly, the solution would be to move away 
from direct human processing towards automatic secu- 
rity alert response processing. This paper proposes an 
initiative to redesign vulnerability databases to be 
machine oriented and amenable to automatic process- 
ing. In practice, such a database would also need to 
integrate vulnerabilities disclosed from multiple 
sources. The dissemination of machine processable 
alerts allows for automated tools to operate on an alert 
immediately without requiring humans in the loop. 
This would cut down the long time interval between 
release of a vulnerability/advisory note and corrective 
action being taken. Other automated tools do exist, 
e.g., Microsoft Windows systems have Software 
Update Services, however there is little which is gen- 
eral purpose, publicly accessible, and open to public 
or third party scrutiny and verification. We have 
developed a proof-of-concept machine oriented data- 
base schema using a vulnerability expression language 
for describing the targets and effects of vulnerabilities. 
To illustrate the use of this database, we have devel- 
oped a prototype vulnerability scanning robot which 
can determine existing and potential vulnerabilities 
based on the database. 
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The creation of an effective machine oriented 
vulnerability database would require the cooperation 
of many parties such as CERT, BugTraq, vendors, 
software developers, etc. As such, this paper is not 
meant to be a standalone definitive solution. Rather the 
prototype database and scanner is intended to spur the 
development of machine oriented databases by the par- 
ties concerned. We believe that our proof of concept 
presents the key elements for further development of 
machine oriented vulnerability databases. The use of a 
simple vulnerability expression abstraction also sim- 
plifies the integration of data from multiple sources. 


Motivation and Design Goals 


Figure 1 reproduced from the following CERT 
report [3] describes the vulnerability exploit cycle. 
The Y-axis represents the number of incidents for a 
given vulnerability. 


The graph illustrates the time lag between the 
release of a vulnerability/advisory report and_ the 
decrease of incidents following corrective measures by 
users. We argue that current vulnerability databases, 
such as CERT, Bugtraq, CVE, in their present format 
are not designed to facilitate a speedy user response 
because they suffer from the following limitations: 

1. Much of the information in these databases, par- 
ticularly the portions which relate to dealing 
with a vulnerability, is only in a human readable 
free-text format. While this may be necessary to 
convey the full information content, it also 
means that a human needs to interpret the data- 
base entry. This makes it difficult to have any 
form of automated machine processing of this 
information. While it is possible to analyse the 
natural language text, this may introduce more 
problems due to ambiguities in natural language. 

2. Different response centers use different termi- 
nologies and conventions in describing one vul- 
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information. For instance, some databases put 
the affected systems according to vendor’s ver- 
sion (e.g., RedHat Linux 8.0). A different vul- 
nerability might refer to the Linux kernel ver- 
sion instead. 

3. There is a conflicting and fluctuating standard 
among response centers which actively pro- 
mote their own methods and standards, thus 
causing frequent shifting and switching among 
different proposed standards. 


Our philosophy is that vulnerabilities should be 
expressible in an explicit form in terms of data (or a 
description) rather than an implicit form like code to 
process a vulnerability. Hence the data can be stored in a 
database (or any data description language, i.e., XML). 
Our database is designed with the following goals: 

e The database is designed so that it can be con- 
solidated from multiple sources, in which each 
vulnerability entry includes the origin of infor- 
mation, environment of which it can cause an 
impact, its consequence, as well as additional 
useful information. 

The pre-requisites and the consequence of a 
vulnerability are described using an abstraction 
which we call a vulnerability expression. The 
vulnerability expression allows a precise for- 
mulation of the nature of the vulnerability and 
is machine processable. The vulnerability 
expressions are not specific to a particular sys- 
tem but rather tailorable to the specific system 
using another mapping, e.g., a configuration 
file may be mapped separately to its pathname 
for the particular system being tested. 

The structure of the database should allow easy 
retrieval both by user and automated-tools via SQL. 


We also want to have an automatic scanner 
which can use the database to do the following: 
¢ Check whether a given vulnerability exists on a 


nerability, which may confuse the users of the local system; 
Novice Intruders] | Automated 
Use Crude Scanning/Exploit 
Developed 
Exploit Tools Tools Develope oe 
Begin 
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Exploit Tools of Automated Types 
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Figure 1: Vulnerability exploit cycle (CERT Coordination Center). 
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e Scan local system for all possible vulnerabilities; 

e Notify the existence of potential vulnerability on 
the local system should certain environmental 
factors such as system services which are cur- 
rently off get activated; 

e Analyze relationship between different vulnera- 
bilities, e.g., whether one vulnerability can be 
exploited to lead to another. 


Related Work 


There have been a number of popular tools that 
scan for any presence of vulnerability or configuration 
weaknesses in a system. Some notable examples are: 
COPS [4], SATAN [5] and Nessus [6]. These tools are 
code-based scanning applications where the logic of 
vulnerability checking 1s embedded tightly in the scan- 
ner’s code. This means that including a new check for 
a vulnerability entry requires one to update the scan- 
ner’s code, its sub-component(s), or its configuration 
file. In contrast, our system uses a generic scanner 
which makes use of vulnerability descriptions stored 
separately in a vulnerability database. While a code- 
based solution is generally more powerful, it requires 
that code/plug-ins be written. There are trust and veri- 
fication issues which we discuss later in this section. 


There is some existing work which reorganizes 
and integrates information in existing vulnerability 
databases into one that is more of a “real database.”’ 
NIST has developed ICAT [7], a searchable index of 
vulnerability entries leading the users to various vul- 
nerability resources and patch information. Similarly, 
Purdue University maintains a web-based search sys- 
tem called “Public Cooperative Vulnerability Data- 
base” [8]. These databases are, however, designed 
mainly for vulnerability search based on categorized 
attribute values, and not for automated applications. 


Krsul [9] proposes a comprehensive taxonomy of 
vulnerabilities for possible further processing or auto- 
mated manipulation. A database is also proposed. It is 
hard to compare the database since no specific appli- 
cations were co-designed with it. 


Windows Update [10] is a Microsoft online tool 
for automatically updating Windows operating systems 
and Microsoft applications with recent patches. It illus- 
trates some important issues with automatic tools. Win- 
dows Update (and its more automatic cousin, Windows 
Software Update Services [11]) are closed systems. We 
propose an open system which can cater for heteroge- 
neous environments. Windows Update has a “black- 
box update model”? which allows easy and seemingly 
automatic patch update, yet the non-transparency of the 
system leads to the following issues: 

1. Privacy and Trust issues: As there is no open 
specification or possible inspection on the scan- 
ner, no complete trust can be put on the scan- 
ner. This is the case with any code based sys- 
tem. It is difficult to determine if the scanner is 
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performing the correct actions while preserving 
local system security policy. Since the update 
system is hosted on the vendor’s server, there is 
no guarantee that local information will not 
leak to an external source thus breaching local 
system privacy. In contrast, our system is open 
to inspection. The database can also be 
deployed locally or organization-wide as dis- 
cussed in the deployment section. 

2. Non-standard vulnerability checking: Win- 
dows Update behaves more as a vendor patch 
update mechanism rather than standardized vul- 
nerability entry checking. Thus, little coverage 
is given back to users in terms of standard vul- 
nerability report information. This might be too 
limiting for system administrator who, for 
example, wants to ensure that his systems are 
up-to-date against recent vulnerability reports 
regardless whether patches for the vulnerability 
are available or not. 

3. No control over the scanner: Users need to 
trust that Windows Update works as it is sup- 
posed to. The importance of this issue is high- 
lighted by a recent incident of Swen-style Tro- 
jan horse which posed as a legitimate update 
[12]. While this example is a social-engineering 
style attack, it illustrates the fact that the Win- 
dows Update mechanism can itself be a vulner- 
ability. In our system, as the database contains a 
machine readable description, all steps of the 
scanner can be verified. 


Some related concerns of the Windows Update 
mechanism is discussed in an article by Berlind [13]. 
We argue that any automatic update or alert processing 
mechanism should be based on an open model which 
can be independently verified. In addition, it should be 
possible for the user to bypass the automatic system in 
cases where the security policy may not allow the exe- 
cution of foreign code or connection to external hosts. 
Morover, the administrator/user should be able to deter- 
mine the consequences of a patch or alert on his system. 


Movtraq: A New Vulnerability Database 


The integrated vulnerability database which we 
have called Movtrag (Machine Oriented Vulnerability 
and Tracking) database is designed to be compiled 
from multiple source vulnerability databases and is 
usable directly by an automatic scanner (see Figure 2). 


Design Considerations 


The main challenge in designing the new data- 
base is to determine what the actual contents of each 
vulnerability entry should be. For our proof-of-con- 
cept, we have focused on what the database should 
contain rather than on a general database schema. The 
data fields corresponding to a vulnerability fall into 
three general categories: general information and ref- 
erences; vulnerability factors and its environmental 
requirements; and impact of vulnerability. 
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General Information Fields 


The general information portion mostly contains 
references to several public vulnerability databases 
such as CERT, Bugtraq, etc. The purpose of these 
fields is to give the user a reference to the original 
source of information to obtain additional information. 
This is mainly for human consumption. 


Vulnerability Factors and Environmental Require- 
ments 


The second category, vulnerability factors and 
environment data, provides the main content of 
machine processable vulnerability information. A vul- 
nerability has to exist within a context, hence it is 
described in terms of its original source factor and 
associated environmental factors. By “original source 
factor,” we mean the system component(s) (applica- 
tion or operating system) where the vulnerability orig- 
inates. “Environmental factors” refers to settings/con- 
figuration or services in the local system which make 
the system subject to the vulnerability. 


We distinguish between two kinds of vulnerabili- 
ties: 
¢ vulnerability which currently exists on the sys- 
tem; and 
¢ vulnerability which potentially exists on the 
system. 


There are a number of different combinations of 
Original source and environment factors: 


Case 1: Vulnerability factors: match & Environment 
factors: match 


We will get this result when a particular vulnera- 
bility’s original source exists on the local system and 
the settings of local system match all the environment 
factors. In this case, we will conclude that the vulnera- 
bility exists on the system. 


Case 2: Vulnerability factors: match & Environment 
factors: no match 


This occurs when we can detect the origin of the 
vulnerability on the local system, however the settings 
of the local system does not match the environment 
factors. So the vulnerability is not applicable but it has 
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the potential to affect the system if the environment 
changes. For example, consider the case of “Apache 
Web Server Chunk Handling Vulnerability” [14]. Even 
if apache is installed, we will not be affected by the 
vulnerability as long as we do not provide http services. 


Although this second case appears to be an 
exception, it is actually not uncommon as a full instal- 
lation of the operating system and application pro- 
grams may have been done. Hence, many installed 
components in the system may not usually be in use. 


Case 3: Vulnerability factors: no match & Environ- 
ment factors: match 


In this case, the vulnerability would appear to be 
not applicable. However, there is a subtle issue. Con- 
sider the case of OpenSSL (an open source implemen- 
tation of the SSL protocol) which had several stack 
overflow vulnerabilities which are exploitable [15]. 
OpenSSL may not be installed as an individual compo- 
nent, so even if there is a database entry for the 
OpenSSL vulnerability, this would return a negative 
result in terms of vulnerability data factors. However, 
OpenSSL is commonly included in applications such as 
Apache, Sendmail, Bind, Linux and Unix based sys- 
tems. Thus, it is necessary to check for the existence of 
such applications which may indicate that such an 
OpenSSL vulnerability exists even if OpenSSL is itself 
not detected. This highlights that one may need several 
database entries corresponding to a vulnerability given 
some of these indirect potential factors. 


Case 4: Vulnerability factors: no match & Environ- 
ment factors: no match 


The vulnerability does not exist on the local sys- 
tem. 


Vulnerability Impact (Consequences) 


The third category of data concerns the impact of 
vulnerability, which describes the possible conse- 
quences of a vulnerability if it is successfully exploited. 
In our database, this is stored as a vulnerability descrip- 
tion expression which is machine processable and 
describes the vulnerability impact in a precise and con- 
cise form. There is no need to use any taxonomy or 
qualitative impact factor (e.g., critical, high, medium, 
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Figure 2: Vulnerability database and scanner. 
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low) which is not precise and may not make sense in 
the context of a particular system. It also enables check- 
ing of the relationship between different vulnerabilities 
and whether they can affect one another. 


Database Structure 


As we have argued, the exact structure of the 
database is not so important. Rather, it is the content 
and having it in a more precise machine processable 
format. In our proof-of-concept design, the database 
has seven main entities namely: 

e Vulnerability Entity — names the vulnerability 
and links it to the specification of the vulnera- 
bility and environment. 

e Vulnerability Specifications Entity — collects 
the vulnerability factors and the impact. 

e Environment Specifications Entity — collects 
the environmental factors. 

e Operating System Entity — vulnerability require- 
ments originating from the operating system. 

e Application Entity — vulnerability requirements 
specific to an application. 

e Services Entity — vulnerability requirements 
specific to a service. 

e Exploit Entity — details of exploits and impact. 


An entity relationship diagram which gives an 
overview of the relationship between these data items 
is given in Figure 3. 

We will briefly mention some of the key fields 
from an integration and machine processable perspec- 
tive. We have mainly omitted fields in the general 
information category which are present in the database 
for human consumption. 

¢ Vulnerability Entity — a textual description for 
the vulnerability, identifiers such as CERT ID, 
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Figure 3: Vulnerability database structure. 
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BugTraq ID, CVE ID and also other keys corre- 
sponding to other tables. 

e Vulnerability Specifications Entity — vulnera- 
bility consequences*, hardware requirements, 
name of vulnerable application/service*. A ser- 
vice could be a daemon. 

e Environment Specifications Entity — existence 
of required user/application or service object* 
which may be exploitable, existence of a file 
object*, remote exploitation flag, application/ser- 
vices environment, hardware requirements. 

e Application/Services Entity — name of appli- 
cation/service, application/service ID, vulnera- 
ble versions, hardware requirements. Services 
have additional fields like protocols, port num- 
bers, etc. 

¢ Operating System Entity — similar to applica- 
tion entity but for the operating system. 

e Exploit Entity — actual exploit (could be a 
URL, filename, etc.), privileges needed*, con- 
sequences of using the exploit*. 


The fields which have been labeled by (*) make 
use of the vulnerability description expressions or vul- 
nerability target objects from the next section. Note that 
some fields which have a similar function occur a few 
times in a different context, e.g., hardware requirements 
may be different for the application and environment, 
there are two different consequences — one from the 
vulnerability and one from using a specific exploit. 


Integrating the Data 


One of the difficulties with dealing with secu- 
rity/vulnerability alerts is the need to integrate the infor- 
mation from multiple sources. Our prototype database 
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is no exception and was built by integrating data from 
multiple vulnerability sources such as CERT, BugTraq, 
CVE, vendors and software developer sites. Ideally, one 
would prefer a single source for the vulnerability infor- 
mation (even if it is only in text form). However, the 
reality is that due to the distributed handling and speed 
of dealing with vulnerabilities, one has to accept that 
integration may be required. 


The following example, which is the “OpenSSL 
SSLv2 Malformed Client Key Remote Buffer Over- 
flow Vulnerability,” illustrates the need for integra- 
tion. It has a CVE ID of CAN-2002-0656 [15]. 


BugTraq from SecurityFocus provides: 

BugTraq ID: 5363 
Application environment: Apache v1.0 - 1.3.26 
OS environment: Linux, Microsoft Windows 
Proof of concept exploit: available 

Minimum user rights for exploit: u#R‘ 


CERT vulnerability advisory provides: 

CERT ID: CA-2002-23 

Vulnerable application version: 
OpenSSL prior to’0.9.6 

Vulnerability impact: @G u#S 


Vendor/software information: 

From OpenSSL (www.openssl.org) we get the 
vulnerable application range as: 0.9.1c - 0.9.5a. 
From apache documentation we know that usually 
the user is root. 


In general, determining the complete environ- 
mental requirements and the consequences of the vul- 
nerability from the textual descriptions can be a 
tedious and time consuming process. This is one ratio- 
nale for a better system such as the one described here. 


Vulnerability Description Expressions 


The main machine oriented data fields in the 
database belong to three categories: system compo- 
nents of the vulnerability; environment factors of the 
vulnerability; and consequences of the vulnerability. 


The first category for various system compo- 
nents is usually specified as versions of the operating 
systems and applications. This can be straightfor- 
wardly encoded in the database. The other two cate- 
gories require a machine friendly specification. 


After studying 943 vulnerability notes from 
CERT advisory database, we found that most of the 
information for these two categories can be described 
effectively using the vulnerability description expres- 
sion described below. These expressions are inspired 
by the rule language in KuangPlus system [16]. 


An expression is written with the syntax: 
(VulnerabilityExpression) = (TargetObject) | 
(Action) TargetObject) 


'This is a vulnerability description expression to describe 
the impact, see the next section. 
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An action is written prefixed by ‘@’. Table 1 illus- 


trates the actions and the types of the corresponding . 
target objects. 









Read|Write (file object f | memory 
object m) 

Access (read and write) (file object f 
| memory object m) 

Create (file object f) 


@K (f|m Corrupt (file object f | memory 
object m) 


Execute (file object f| code object c 
Crash | Disrupt (node object n | 
application object a) 

object a | service object s) 


Use (resource object r 
@E (r) Exhaust (resource object r 


Table 1: Actions in vulnerability description expres- 
sions. 















Rather than giving a formal definition of target 
objects, we have listed examples of target objects in Ta- 
ble 6. In Table 6 the following prefixes are used: *%’ is 
used to denote an actual value; ‘#’ is used to denote a 
symbolic value; and ‘&’ is used for expressing users/ 
groups associated with an application/service. 


As our proof-of-concept implementation is for Unix 
systems, the examples and objects are also Unix based. 
Vulnerabilities for other operating systems may require 
extension to the types of target objects and actions. 


Examples using Vulnerability Expressions 


The following examples use the expressions to 
describe various vulnerability consequences.? . 
¢ @D n#N: Denial of Service for the whole net- 
work (Ref: Cisco IOS Interface Blocked by 
IPv4 Packet CERT ID VU#411332). 

¢ @G u#S: Gain superuser rights. (Ref: Linux 
Kernel Privileged Process Hijacking Vulnera- 
bility, Bugtraq ID: 7112). | 

°@G u#R : Gain Remote user right (Ref: 
Apache htpasswd Password Entropy Weakness, 
Bugtraq ID: 8707). 

e @S a#mailman; @D a#mailman: Crash mail- 
man application, and deny its service. (Ref: 
Red Hat Linux GNU Mailman Remote Denial 
Of Service Vulnerability, Bugtraq ID: 10147). 

° @R f%/etc/passwd : Read file /etc/passwd. 

© @X f#*(4777) : Execute a file with setuid per- 
mission. 


The following are examples of portions of the 
machine oriented fields in the database for several vul- 
nerabilities: , 


2For simplicity, multiple expressions are separated by 
semicolon. 
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Syntax [ “Semantics 
User Object 


ui 

__wiP | Physical user 
that of the current user 

u&App User running corresponding 


application process 


u&Sve User running corresponding ser- 
vice (1.e., daemon) 


User who can access or control 
the OS kernel 


(g): Group Objects 
application process/service 
File Objects 


u& Kernel 











f#* All files 
f#passwd Pathname corresponding to the 
passwd file 
f#shell Pathname corresponding _ to 





shell files, e.g., “*/bin/bash”’ 


f#system Pathname corresponding to sys- 
tem files in the OS 
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Syntax [———«SSemanties 
f#* (4777) All files with permission 4777 


f#F Files beyond current user access 
rights 


f{%/etc/passwd | The file ‘‘/etc/passwd” 
f&App File associated to the running 
application process 
(my ___| Memory Object 
m#M Memory area beyond the cur- 
rent user’s access right 


Node Object 


Scanned node where an applica- 
tion program is installed and/or 
related service is running 

nL 


Network 


Te 
n%IP Node at IP address (may be a 
range) 


c#((u)) Piece of code with the execu- 
tion privilege of the user object 
u, e.g., privilege escalation 























Table 2: Objects in vulnerability description expressions. 


¢ MySQL Password Handler Buffer Overflow 
Vulnerability: 
CVE_ID: CAN-2003-0780 
Bugtraq_ID: 8590 


Vul_ Con: @X C#(u%mysql); @G u%mysql; @G u#L 


Vul_ OS: null 

Vul_ App: Various Mysql versions 
Env_User: u#L 

Env_ File: null 

Env_Rem: No 

Exploit: No 

Env_OS: null 

Env_App: Mysql 

Linux Kernel IOPERM System Call IO Port 
Access Vulnerability: 

CVE _ ID: CAN-2003-0246 

Bugtraq ID: 7600 

Vul_ Con: @A f#F 

Vul_ OS: Various Linux distributions 
Vul_ App: null 

Env_User: u#L 
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Env_File: null 
Env_Rem: No 
Exploit: No 
Env_OS: Linux kernel 2.4.0 - 2.4.21, 2.5.0 - 2.5.69 
Env_ App: null 

Linux 2.4 Kernel execve Race Condition Vul- 
nerability: 

CVE_ID: CAN-2003-0462 

Bugtraq ID: 8042 

Vul_ Con: @A f#F; @X c#(u#S); @G u#S 

Vul_ OS: Various Linux distributions 

Vul_ App: null 

Env_User: u#L 

Env_File: f#*(4111) 

Env_ Rem: No 

Exploit: Yes 

Env_OS: Linux Kernel 2.4.0 - 2.4.21 

Env_App: null 

Multiple Vulnerabilities In OpenSSL: 

CVE_ID: CAN-2002-0656 

Bugtraq ID: 5363 
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Vul_ Con: @X c#(u&App); @G u#L 

Vul_ OS: null 

Vul_ App: Various Apache versions and 

OpenSSL-based applications 

Env_User: u#R 

Env_File: null 

Env_ Rem: Yes 

Exploit: Yes 

Env_OS: null 

Env_App: Corresponding service provided 
by the vulnerable application 


Translation Issues 


From our experiments in translating text-based 
vulnerabilities into vulnerability expressions, 


encountered the following issues: 


e The vulnerability description in the database 


sources is sometimes rather vague. Some exam- 
ples are: “could expose sensitive information to 
local attackers” (Bugtraq ID 8233), “‘gain 
access to sensitive information” (Bugtraq ID 
9558), or “leads to unauthorized access to 
attacker-specified resources” (Bugtraq ID 
9778). We require a more specific consequence 
which either means describing it in a catch-all 
fashion or much more work is required to 
understand the vulnerability. 

Our vulnerability expression language is 
designed to capture general expressions at the 
OS level. It does not express various applica- 
tion specific descriptions, such as: “to access 
variables outside the Safe compartment” (Perl, 
Bugtraq ID 6111), or “could compromise the 
private keys of ElGamal signing key implemen- 
tation” (GnuPG, Bugtraq ID 9115). To deal 
with such consequences, these are approximated 


we 
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by translation into the closest vulnerability 
expressions capturable by our language. In the 
two examples above, we can rewrite them into: 
access of memory and files beyond the current 
user’s right, respectively. 

¢ Some vulnerability entries, particularly those of 
CAN(didate) type, are listed as ‘“‘unknown con- 
sequence”’ (e.g., Bugtraq ID 10428). Hence, we 
either have to ignore such entries for the 
moment, or use a special form to indicate 
unknown consequences. 


Movtragq Scanning Robot 


To demonstrate the use of the Movtraq database, 
we have implemented a prototype automatic vulnera- 
bility scanner (called the Movtraq scanning robot). 
The robot runs on two different versions of Unix: Red- 
hat Linux and FreeBSD. This is to demonstrate a 
degree of platform independence. 


The overall structure of the robot together with 
the database is depicted in Figure 2. The integrated 
Movtraq database is stored in MySql. The scanner 
consists of a local system configuration collector 
which collects information about applications, operat- 
ing system (which processes are running, which ports 
are open, hardware details, etc.) and services on the 
system. Software versions are obtained by using the 
rpm utility on Redhat and pkg_info utility on FreeBSD. 
The scanner is written in Perl and queries the MySql 
Movtraq database using SQL. 


The robot has three basic scanning options: 

e Vulnerability checking: checks if the system is 
vulnerable to the vulnerabilities specified in the 
database (a Case | vulnerability). 


1. Apache Mod_Auth_Any Remote Command 
Execution Vulnerability 

Application version check: positive 

Service port check: negative 

Conclusion: source application is detected, 

default port required is not open, 

potential vulnerability exists but does not 

affect current system configuration 


2. Sun One/iPlanet Web Server Vulnerability 


to DOS 


Application version check: n/a 
Conclusion: source application not detected, 


safe from vulnerability 


3. Linux Kernel IOPERM System Call 
IO Port Access Vulnerability 

OS version check: positive 

OS environment check: positive 

Conclusion: vulnerability detected! 


4, MySQL Password Handler Buffer Overflow 


Vulnerability 
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Application version check: positive 
OS environment check: skipped 
Conclusion: vulnerability detected! 


Listing 1: Sample scanner log. 
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e Potential vulnerability checking: checks for 
software vulnerability which exists but the sys- 
tem is not currently vulnerable due to environ- 
mental reasons (a Case 2 vulnerability). This 
can be useful since it may be the case that the 
system can become vulnerable later, e.g., if a 
service which was off is turned on. 
Vulnerability with exploit checking: enhances 
vulnerability checking to see if the listed exploits 
are directly applicable — this adds the constraints 
of the exploits into the checking process. 


An abbreviated sample log from running the 
scanner illustrates how application, version and envi- 
ronmental checking is performed; see Listing 1. Only 
some of the pertinent checks from the log are shown 
to illustrate the following points: 

e Example 1: apache vulnerability exists but 
environment check fails since the required port 
is not open. 

e Example 2: no vulnerability since application 
is not installed. 

e Example 3: an OS vulnerability so only OS 
checking is used. 

e Example 4: vulnerability inherent to MySQL 
version, OS environment checking is skipped 
as it is not required. 


Vulnerability Chaining Analysis 


An interesting use of the scanner is that it can be 
used to test if existing vulnerabilities can be combined 
together (chaining) to create more vulnerabilities. This 
mimics what a hacker might do to take advantage of 
indirect weaknesses on the system. 


Consider the following example which is typical 
of a privilege escalation attack. Suppose the system 
has the following two vulnerabilities: 

Name: Buffer Management Vulnerability 
in OpenSSH 

Vul_ ID: 57 

CVE _ ID: CAN-2003-0693 Bugtraq ID: 8628 

Vul_Con: @G u#L 


Vul_ OS: null Vul_ App: Openssh apps 
Env_Usr: u#R Env_File: null 
Env_Rem: Yes Exploit: No 
Env_OS: null 
Env_App: Service provided by the vulnerable app 
Name: Linux 2.4 Kernel execve Race 

Condition Vulnerability 
Vul ID: 48 


CVE_ ID: CAN-2003-0462 Bugtraq ID: 8042 
Vul Con: @G u#S 


Vul_OS: Linux Vul_ App: null 
Env_Usr: u#L Env _ File: f#*(4111) 
Env_Rem:No Exploit: Yes 


Env_OS: Linux kernel 2.4.0 - 2.4.21 
Env_App: null 


In this example, the scanner discovers that both 
vulnerability 48 and 57 are present. From Vul_ ID: 57 
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a remote user (u#R) can gain local rights (@G u#L), 
and this chains onto Vul ID: 48 which has a local 
environment requirement (local user: u#L and setuid 
executable file:f#*(4111)). Thus it discovers that a 
remote user may be able to exploit the two vulnerabili- 
ties to gain local root access. 


Chaining analysis illustrates the benefit of a machine 
oriented approach and the use of vulnerability expressions 
to analyse relationships between vulnerabilities. 
Operating System and Local Configuration Map- 

ping 

Because environmental and application vulnera- 
bility data are expressed as vulnerability expressions, 
these abstractions may need to be further refined. In 
the context of a particular local system configuration, 
operating system distribution, etc., additional localiza- 
tion may be needed to map the abstractions to concrete 
objects. One may choose to have additional databases 
to do this mapping from vulnerability target objects to 
the actual objects on the system. Our robot prototype 
does not do this since it has been tested only on Red- 
Hat and FreeBSD. 


Deployment Strategies for vtraq 


The prototype Movtraq system is sufficiently 
useful to be deployed in a number of ways. Some of 
the potential scenarios depicted in Figure 4 are: 

e Scenario 1: Local vulnerability database, local 
client. Here, each local machine hosts its own 
database. The Movtraq database is meant to 
have been downloaded (securely) from another 
server. This has the advantage that the database 
is local and thus all operations can be done 
locally. The disadvantage is that an up-to-date 
database has to be maintained from every host. 
Scenario 2: Organization-wide database, local 
client. This simply extends scenario | to an orga- 
nizational context where there is an organization- 
wide database server. Where multiple machines 
have exactly the same configuration, one may 
choose to only check on a subset of the machines. 
Scenario 3: Internet-based database, local 
client. Lastly, like in automatic update systems, 
a database server somewhere on the internet 
serves as the database repository. 
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Figure 4: Deployment Options for movtraq. 


These strategies are suitable for our Movtraq 


proof-of-concept system but one could have more 
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general systems. For example, one could have a scan- 
ner which is partially local and partially remote. This 
may be useful in an organizational context where any 
system configuration changes are registered with a 
separate non-local configuration database. Any secu- 
rity alerts are then checked externally against this con- 
figuration database. 


Discussion 


We believe that there is a real need for vulnera- 
bility databases which integrate the necessary pieces 
of information for evaluating the impact of any new 
vulnerability and allows the appropriate action to be 
taken automatically. Furthermore, in order to be 
timely, we argue that the vulnerability evaluation 
process should not be dependent on having humans 
process alerts. This does not mean that we advocate 
not having humans at all in the loop but rather that the 
loop should not be dependent on the speed of a human 
response. Thus, it is important that there be a not only 
human readable vulnerability database but also one 
which is geared for automatic processing by machines. 
As far as we are aware, the existing systems for dis- 
seminating alerts are still primarily human oriented as 
are the key source databases. 


We have demonstrated a proof-of-concept data- 
base which allows effective integration of data from 
multiple sources and can be used directly by an auto- 
matic vulnerability scanner. In the workshop report on 
security vulnerability databases [17], it was remarked 
that some of the difficult issues are to do with terminol- 
ogy and the schema of the database. Our database 
design uses both abstraction and separation of exploits 
from vulnerabilities — both of which are highlighted in 
the report. In particular, the use of abstraction, which for 
us is how the database caters for automated analysis and 
machine processing, simplifies the issue of terminology 
and taxonomy. This is a plus point since these are often 
controversial from a textual description viewpoint. 


The database described here is meant to be a 
proof-of-concept system and is not necessarily com- 
prehensive. However, the prototype scanner demon- 
strates that we capture the essential elements of a 
machine-oriented database. As this prototype was 
designed for Unix systems, for other operating sys- 
tems, such as Microsoft Windows, both the database 
and vulnerability expressions may need to be 
enhanced. However, the fundamental concepts in the 
design should still be applicable. 


Finally, our proposal also addresses a number of 
important practical issues: 

e Integration of Vulnerability Information: An 
integrated database is ideal but may not be 
practical given that many separate parties are 
involved in putting together the requisite infor- 
mation. However, it is fairly simple as an addi- 
tional step to put out the information in the kind 
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of machine oriented form we have advocated 
and also to concentrate on the relevant data 
from a machine perspective. In our prototype, 
we have only built a small integrated database 
since it is rather time consuming to do so man- 
ually from scratch using the existing data 
sources. However, once vulnerability informa- 
tion is disseminated in the right format, integra- 
tion becomes significantly easier. 

¢ Verifiable Vulnerability Processing: It is cer- 
tainly the case that any automatic update or 
scanning system would be welcome by system 
administrators. However, unless one can deal 
with the privacy and trust issues, there are sig- 
nificant downsides to the use of such systems. 
Again, an integrated machine oriented database 
such as Movtraq allows decoupling of the infor- 
mation from the processing and as it is simply a 
database, it can be subject to verification. 


Further work would involve convenient GUIs, 
fully featured implementation, Windows compatibil- 
ity, and a more sophisticated vulnerability model. 
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ABSTRACT 


This paper presents a Linux kernel module, DigSig, which helps system administrators 
control Executable and Linkable Format (ELF) binary execution and library loading based on the 
presence of a valid digital signature. By preventing attackers from replacing libraries and sensitive, 
privileged system daemons with malicious code, DigSig increases the difficulty of hiding illicit 


activities such as access to compromised systems. 


DigSig provides system administrators with an efficient tool which mitigates the risk of 
running malicious code at run time. This tool adds extra functionality previously unavailable for 
the Linux operating system: kernel level RSA signature verification with caching and revocation 


of signatures. 


Introduction 


In the past years, the economical impact of mal- 
ware-like viruses and worms has regularly increased. 
Even though the target platform of many malware is 
Windows, with the increasing popularity of Linux as a 
desktop platform and its wide use as public server, the 
risk of seeing viruses or Trojans developed for this 
platform is rapidly growing. 


These malware can be installed on the system 
through different sources. On desktop systems, a 
major source of malware lies in careless users who 
introduce viruses, worms, Trojans, or other nuisances 
through email attachments or internet downloads of 
Trojaned software. 


On server side, very often, vulnerabilities like 
buffer overflows in public services are exploited by 
the attacker to install rootkits to replace system bina- 
ries and libraries with Trojaned versions. These rootk- 
its are then used to assure a continued access to a com- 
promised machine and mask attacker’s illicit activity. 
For instance, the Remote Shell Trojan (RST) infects all 
ELF binaries of the /bin directory, offering a backdoor 
process with a command shell at the privilege level of 
the invoking user. 


Even though there are actually different origins to 
the spread of these malware, the final result is often the 
same, an unauthorized binary running on the system. 


To mitigate this risk, system administrators com- 
monly deploy restrictive solutions such as firewalls, 
virus scanners, or intrusion detection tools. Although 
those tools do have positive impact on system security, 
they have proven to be insufficient; a firewall for 
instance is usually incapable of detecting covert channels. 
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Papers such as [1] have already raised the alarm, and 
[2] even compares firewalls to the French Maginot 
line.1 Virus scanners and intrusion detection systems 
also show several limits, such as their incapacity to 
detect totally new viruses or intrusions. Indeed, their 
detection engine most usually relies on an extensive 
signature database, sometimes enhanced with an 
heuristic algorithm to detect known viruses/intrusions 
and their close cousins. This results in a time gap 
between the first spread of new viruses and their char- 
acterization through signatures. This gap can be used 
by attackers to penetrate the internal network of the 
company. In theory, only few systems based on users’ 
normal behavior or misuse detection model are capa- 
ble of detecting brand new viruses; however, they are 
more at research stage yet [5]. 


Supporting the concept of defense in depth, this 
paper consequently proposes a new layer of defense to 
existing mechanisms, named DigSig. Used in addition 
to firewalls, virus scanners, or IDS, this paper high- 
lights how DigSig can significantly increase security. 
DigSig does not prevent malicious applications to be 
installed on the system, but prevents their execution, 
which is when they actually become dangerous. 


The paper is organized as follows. First, we 
describe how DigSig enforces digital signature verifi- 
cation at ELF file loading time. Second, we explain 
how system administrators would typically deploy 
DigSig on one or several Linux hosts. Then, we focus 
on the security aspects of this kernel module and in 


'For readers not familiar with the history of the second 


world war, the German army went around the main French 
defense line, called Maginot and defeated the French army 
in 40 days! 
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what way it counters attacks. We analyze the perfor- 
mance impacts of DigSig. Finally, we mention some 
related work, including some complementary systems 
DigSig might be added to. 


DigSig Kernel Module 


DigSig is implemented as a Linux kernel mod- 
ule, which checks that loaded binaries and libraries 
contain a valid digital signature. In the case of an 
invalid signature for the binary or for any of its shared 
libraries, the execution is aborted. 


When an ELF file is to be loaded into an exe- 
cutable memory region, DigSig searches the file for a 
signature section. If no such section is available, load- 
ing is refused. Otherwise, DigSig hashes the contents 
of all text and data segments, and compares the result 
to the contents of the signature section decrypted with 
a system’s public key. If values do not match, this 
means the file was not signed with the corresponding 
private key, or that it was modified after signing. In 
such a situation, loading is refused. 


An attacker who now attempts to replace ps, /s, 
sshd, libe, or any other common rootkit target with 
Trojaned versions will find these files cannot be exe- 
cuted. 


DigSig adds a new section into the ELF binary; 
therefore, it works only with binaries with ELF for- 
mat. As ELF is the predominate format in Linux sys- 
tems, compiling a kernel with only ELF support 
should not cause problems. In this article, whenever 
mentioning binaries, only the case of ELF formatted 
binaries is considered. At this time, DigSig also does 
not cover the case of scripts. This is outside the scope 
of this paper, but will be addressed in the future. 


The following subsections detail the implementa- 
tion of DigSig. 
Signing the Binary 

Before verifying the signature of a binary, the 
binary needs to be signed and the signature stored in 
order to retrieve it. Executables and libraries are ini- 
tially signed offline, using the Debian userspace pack- 
age BSign [3]. BSign embeds an RSA signature of all 
text and data segments into a new ELF segment called 
the “signature” segment (see Figure 1). Then, once 
signed, the RSA private key is safely stored offline 
and removed from the system, while the correspond- 
ing public key is loaded in the DigSig kernel module. 


Verifying the Signature of the Binary 


At execution time, the DigSig kernel module ver- 
ifies the signature of the binary/library. This requires 
the following functionality at kernel level: 

¢ Support for hash functions: The first step of 
each digital signature consists of hashing the data 
to sign. This is provided by the CryptoAPI, 
which is part of Linux 2.6.x main stream kernels. 

°¢ Public key cryptography: This is in order to 
verify the signature. 
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¢ Executable file loading mediation: This allows 
us to verify a signature before running the 
binary, and optionally refuse execution. 


The RSA algorithm is used for verifying the signa- 
ture of the binary. As there is currently no native imple- 
mentation of RSA at kernel level, we had to import our 
own implementation into the kernel. Yet, in cryptogra- 
phy, re-inventing the wheel often turns out to be 
extremely dangerous, so it was decided to use the well- 
tested, GPL’ed GPG implementation of RSA and 
port the necessary parts for use in kernel space. In order 
to avoid bloating in the kernel, only 10% of the original 
code of GPG has been imported. This code is currently 
isolated in a specific directory (digsig/gnupg) of DigSig. 





Text segment 
Data segment 


Section header table 


Embeds an RSA 


a 4 BSign Section enatnte 


Figure 1: Sign signature section as added in an ELF 
binary. 


As for mediating the loading of executables, 
DigSig uses the Loadable Security Modules (LSM) 
architecture [6], which has now become an established 
part of the kernel. It allows a module to define hooks 
to annotate kernel objects with security data and medi- 
ate access decisions for such objects. One such hook, 
security_file_mmap, is called whenever mmap is called 
to map a file to a memory region. This is done by 
sys_execve to load binaries, as well as by dlopen to 
load shared libraries. The mmap hook is consequently 
a convenient location for DigSig to mediate exe- 
cutable mapping of ELF binaries [7]. 


Caching and Revocation Lists 


In order to increase performance of signature 
verification, DigSig caches a list of binaries whose 
signatures have already been verified. The first time 
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an ELF binary or library is loaded, its signature is ver- 
ified. If the signature is correct, then the successful 
signature verification is cached. In subsequent loads, 
DigSig only checks the presence of this signature vali- 
dation in the cache. This results in a significant 
improvement in performance, as detailed later. If the 
file is later opened for writing, the security_inode_per- 
mission LSM hook will be triggered, to which DigSig 
will respond by removing its signature from the cache. 
The size of the signature verification cache can be 
specified at module load time, but defaults to 512 sig- 
nature verifications. 


ELF ELF 
binary or binary or 
libra execute libra write 


’ 
security’ file _mmap security inode permission 





Remove 
First signature 
execution ? from cache 
PS No, check it is 
in the cache 
Yes 
Valid signature Cache 
Additts cache *| ‘listo 
valid 
Invalid ; signatures 
signature 02d binary : 
or Library 


Figure 2: DigSig’s caching mechanism. 


The introduction of a signature validation cache 
might seem like a step toward a binary whitelist. In 
particular, it might seem a simpler solution to elimi- 
nate digital signatures altogether, and keep only a 
whitelist of acceptable binaries and libraries. Files on 
the whitelist would be executable but not writable, and 
files not on the whitelist would not be executable. This 
approach would eliminate the need to verify crypto- 
graphic signatures on each file on the first load. How- 
ever, the DigSig approach has several advantages. 
First, files with cached signature validations that are 
not being executed can still be written to. Their signa- 
ture validation will merely be removed from the 
cache. Second, the signature validation cache need not 
be updated to install new libraries or executables. 
Therefore, software can be installed and upgraded on a 
live system, and a DigSig system in many ways acts as 
a normal Linux system. Third, the signature validation 
cache cannot be edited from user-space, preventing 
attacks against the cache that an editable whitelist 
might be subject to. Fourth, a DigSig system can use a 
relatively small cache, as its sole purpose is perfor- 
mance improvement. A pure whitelist system would 
need to keep entries for every binary and library. 


DigSig also implements a signature revocation 
list, initialized at startup and checked before each sig- 
nature verification. When programs that were previ- 
ously signed with the correct private key are later 
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found to contain vulnerabilities, their signature may be 
revoked by adding them to the revocation list. This is 
particularly convenient because it eliminates the need 
to generate a new key pair and resign all binaries and 
libraries on the system just for a few revocations. The 
revocation list is communicated to DigSig using the 
sysfs filesystem, by writing to the /sys/digsig/digsig_ 
revoke file. 

Therefore, DigSig allows the binaries to be eas- 
ily updated over time. This simplifies the evolution of 
the system over time which is a major requirement for 
many systems where the binaries can be changed or 
added over the lifetime of the system. 


Deploying DigSig 


DigSig requires neither kernel patching nor re- 
compilation of the binaries. Therefore, there are no 
major changes in the system necessary to deploy 
DigSig. 

DigSig installation requires three initial steps: 

1. Generate an asymmetric key pair (for instance 


using GPG). 

2. Sign all trusted binaries and libraries (with 
BSign). 

3. Modify the system startup procedure to load 
DigSig. 


From a deployment point of view, the first step 
raises the issue of key storage. Obviously the private 
key should be kept confidential. To this end, Bsign 
suggests it should be kept on a removable physical 
support such as a floppy disk or a flash memory key or 
a read-only CDROM. It is also a good idea to keep a 
hash of the public key, for instance using sha1sum, to 
check that an attack hasn’t replaced the real public key 
with another one. This approach is particularly conve- 
nient in centralized networks where users connect 
from remote terminals onto a single application server: 
DigSig only needs to be deployed on the server to 
secure execution of all users. However, in networks of 
individual PCs or laptops, a compromise needs to be 
made between having one key per host (good security, 
but perhaps a burden for administrators) and having a 
single shared key for all (security issue: if one host is 
compromised, all are). Such issues are not specific to 
DigSig, but general to PKI. 


As for the second step, signing all binaries and 
libraries can easily be done through the following 
command: 

bsign -s -v -I -i-/ -e /proc \ 
-e /dev -e /boot -e 
/usr/X11R6/lib/modules 


It takes about an hour to sign a// binaries of a Fedora 
Core | installation using Bsign 0.4.5 on a Pentium 4 
2.2 GHz with 512 MB of RAM. Note that the revoca- 
tion list system reduces the need for re-signing the 
whole system. This fact is particularly important for 
systems that wish to offer uninterrupted service. 
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The third step only requires minor modification 
of the operating system’s startup files. Automatic 
loading of the DigSig kernel module can be achieved 
by adding DigSig to /etc/modules, and appropriate 
options in /etc/modutils. 


Security Considerations 


In this section, a short security analysis of 
DigSig is conducted to help system administrators 
understand under what circumstances DigSig does or 
does not increase security. This assists system admin- 
istrators in deciding the best approach in deployment. 


DigSig operates as a kernel module. It thus 
requires root privileges for loading and unloading the 
module, and assumes the secrecy of the DigSig private 
key, the integrity of its public key, root access to the 
system, and the Linux kernel itself are not compro- 
mised. For the rest of the study, these requirements are 
taken for granted; however, please note this is not 
always the case (see the section on related work). 


It is important to understand DigSig has not been 
designed for vendors but rather for system administra- 
tors. System administrators have total control over 
what should, or shouldn’t, execute on the machines 
they administrate. There is no way a vendor can hope 
to lock up a given machine to a given software unless 
with the system administrator’s consent. In_ brief, 
DigSig targets more prevention against attackers than 
DRM or software version management. Its two major 
goals are the following: 

1. If a binary has been signed, no one can modify 
the binary without the modification being 
detected. 

2. Nobody can execute or load an ELF binary or 
library unless it has been signed. 


However, note DigSig cannot protect a system 
from vulnerabilities within legitimately installed and 
signed software. Let’s see how DigSig achieves secu- 
rity. First, we detail how DigSig prevents modification 
of ELF binaries. Second, we examine the case of 
libraries. Finally, we analyze the security of the signa- 
ture caching and revocation mechanisms. 


As described earlier, when a file with a cached sig- 
nature verification is opened for writing, the signature 
verification is removed from the cache. However, this 
does not protect files that are still being executed. Fortu- 
nately, the second protection comes from the Linux ker- 
nel itself: the kernel forbids executing a file that is 
opened for writing and reciprocally. This is accom- 
plished by calling deny_write_access(file) kernel hook, 
and even the superuser is subject to such restrictions. 


Unfortunately, the same defense is not extended to 
shared libraries. Worse, the deny_write_access function 
is not exported to kernel modules. DigSig must there- 
fore implement its own protection for libraries, which it 
does very similarly to the kernel. It blocks any attempt 
to mmap a library with executable permission if that 
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library is already open for writing. If the mmap suc- 
ceeds, then DigSig increments a usage counter for the 
inode. So long as the usage counter shows the file to be 
in use as a library by some process, no one, including 
the superuser, may open the file for writing. 


Under some circumstances, these defenses may 
still not be sufficient. In particular, deny_write_access 
(file) and the DigSig shared library writer lock work by 
marking the VFS inode. They are therefore restricted 
to a single machine. An NFS mounted file being exe- 
cuted on one client, for instance, could be modified on 
the server or on any other client. To reduce this threat, 
DigSig does not cache signature verifications for NFS 
mounted files. However, this does not protect NFS 
mounted files while they are in use. 


As for the signature caching mechanism, one 
might fear it introduces a possible attack point. How- 
ever, since the cache is stored in kernel memory, user 
space programs cannot directly insert fraudulent sig- 
nature validations. Signatures may be added to the 
cache only through a non-exported function dsi_cache_ 
signature, which is only called in one place, when a 
signature has in fact been validated during dsi_file_ 
mmap. While a user-space application cannot directly 
inject fraudulent signature validations, a Trojan kernel 
module could of course do so by directly manipulating 
memory. However, a Trojan kernel module could also 
stop DigSig altogether [10], so this is not a valid argu- 
ment against signature caching. 


Finally, it is important to note the signature revo- 
cations open the possibility of denial of service. It is 
vital that an attacker not be able to add valid signa- 
tures to the revocation list. To ensure this, DigSig 
restricts access to the communication interface (/sys/digsig/ 
digsig_revoke) to root, so that only root can provide 
revocation lists to DigSig. As a further precaution, 
revocation lists can only be appended to before DigSig 
begins enforcing; that is, before a public key is pro- 
vided. Care should therefore be taken to ensure that 
DigSig is enabled before the system is exposed to 
threats, such as before a network connection is 
enabled if the network is the primary means of attack. 
Additionally, the integrity of the collection of revoca- 
tions must be guarded. For instance, they can be stored 
on a read-only media (such as a cdrom), or simply 
signed by GPG. 


Performance 


Figure 3 presents DigSig’s overhead according to 
the execution time. In the following, all measures are 
done with a key size of 1024 bits for the RSA algo- 
rithm and use of SHA-1. The overhead induced by 
DigSig grows linearly with the size of executables. 
However, the gradient is very small: approximately 
only 0.0016 microsecond per byte. 


This is not believed to be critical because unlike 
Windows operating systems, Unix systems only have 
few very large executables. As a matter of fact, a typical 
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Debian Woody workstation only shows 1.8% of exe- 
cutables and libraries above 512 KB. 


As DigSig’s signature verification is performed 
once, at the beginning of load time, it is important to 
note that its induced overhead naturally decreases for 
long-lived applications. Yet, on Unix systems, admin- 
istrators and users keep on executing small commands 
such as /s, cp and cd. In such cases, the cost of signa- 
ture verification is amortized by DigSig’s signature 
cache (see the second section). 


Kernel without DigSig 


real 0m0.004s 
user 0m0.000s 
sys 0m0.001s 
DigSig without caching 
real 0m0.041s 
user 0m0.000s 
Sys 0m0.038s 
DigSig with caching 
real Om0.004s 
user 0m0.000s 
Sys 0m0.002s 


Figure 4: Time required for “*/bin/Is -Al’”’. 


The efficiency of the caching system is demon- 
strated by Figure 4. This figure displays the average 
execution time, in seconds, when running a typical textt- 
tls -Al command 100 times, using the Unix time com- 
mand. The benchmark was run on a Linux 2.6.6 kernel 
with a Pentium 4 2.2 Ghz, 512 MB of RAM. As signa- 
ture validation occurs in execve, DigSig’s overhead is 
expected to show up during system time (sys). The 
benchmark results clearly highlight the improvement: 
there is now hardly any impact when DigSig is used. 
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Finally, to provide a better insight into the actual 
impact of DigSig on real workloads, three kernel com- 
piles were timed on a non-DigSig system, and three on 
a digsig system. The tests were performed using a 
2.6.7 kernel on a Pentium 4 2.4 GHz with 512 MB of 
RAM. The kernel being compiled was a 2.6.4 kernel, 
and the same .config was used for each compile. Each 
compile was preceded by a “‘make clean”. Results are 
shown at Figure 5. The first execution time, both with 
and without DigSig, appears to reflect extra time 
needed to load the kernel source data files from disk. 


Kernel without DigSig 
real sys 
19m21.890s 1m27.992s 
19m 9.276s 1m26.584s 
19m 9.464s 1m26.191s 
19m 7.717s 1m25.799s 
Kernel with DigSig 
real sys 
19m19.957s 1m28.541s 
19m 7.485s 1m26.832s 
19m 7.883s 1m26.549s 
19m 6.494s 1m26.618s 


Figure 5: Time required for 2.6.4 kernel “‘make”’. 


Related Work 


This section presents a few related tools that more 
or less have the same goals as DigSig, but also supple- 
mentary work that can be used together with DigSig. 


As previously stated, on a security point of view, 
DigSig assumes the root account has not been compro- 
mised. In circumstances where this is unacceptable, there 
are ways to circumvent this requirement. 


"PlotPerf.dat" —-+— 


40 50 60 70 


Digsig overhead (ms) 


Figure 3: DigSig overhead (ms) for the first load (without caching) per executable size (bytes). 
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One solution, well known in the Linux domain, 
relies on SELinux [9]. This security enhanced Linux 
proposes an implementation of Mandatory Access 
Control policies, where access control on objects is set 
according to their sensitivity and not necessarily by 
their owner. An immediate consequence to this model 
is that root becomes much less powerful. On “nor- 
mal” Unix operating systems, root is a super user with 
super powers. With SELinux, root does not necessar- 
ily have access to all objects; this limits risks in case 
of root compromise. Permissions are specified at a 
very fine grained level and include access to files, 
shared memory, POSIX capabilities, and sockets, 
among others. SELinux could be used in addition to 
DigSig, to provide the needed secrecy and integrity 
guarantees of DigSig keys and revocation lists, even in 
the case of a root compromise. 


Another alternative relies on Trusted Computing 
solutions, such as the specifications provided by the 
Trusted Computing Group (TCG) [11]. In TCG, nor- 
mal PC architecture is enhanced with a small security 
hardware called the Trusted Platform Module. In par- 
ticular, the TPM offers protected storage of data, irre- 
mediably binding data to Platform Configuration Reg- 
isters (PCRs). The secret stored on the TPM may only 
be retrieved by the TPM owner if the configuration of 
the PC (held in PCRs) hasn’t changed. This offers two 
levels of protection in case of root compromise. First, 
the TPM owner and root are not necessarily the same 
person. Second, if root account has been compro- 
mised, most of the time this is due to a malicious 
application (a rootkit) has been installed. Fortunately, 
installing a rootkit impacts the PC’s configuration, so 
the TPM forbids retrieval of secrets it stores. Sample 
implementations of trusted computing on Linux can be 
found in [12, 13, 14]. 


SELinux and Trusted Computing largely encom- 
pass DigSig’s goals. However, they offer practical 
opportunities to supplement DigSig in stricter security 
sensitive situations. On the other hand, disadvantages 
of such solutions rely on their complexity, and on the 
need for specific hardware in the case of trusted com- 
puting. On the contrary, DigSig’s small size makes it 
easy to install, configure or re-use. 


DigSig may also be compared to similar tools 
such as PaX [15] and ExecShield [16]. It is important 
to note that PaX and ExecShield do not exactly have 
the same goal as DigSig; they attempt to prevent soft- 
ware exploits from being used to execute arbitrary 
code. This is done by placing strict limits on mmaped 
regions, and by using address space randomization. In 
brief, PaX and Exec protect the system against 
exploits of malicious code, while DigSig prevents 
malicious code from being executed. 


Tripwire is in some respects similar to DigSig. It 
maintains a signature database of all files on a machine, 
and notifies the administrator when some of them are 
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modified (possibly replaced by Trojaned versions for 
instance). However there are two major differences 
from DigSig. First, Tripwire works at user level, not 
within the kernel. Second, it does not provide on-the- 
fly verification of file signatures. For instance, there is 
no way to trigger signature verification when a binary 
is executed. So, Tripwire could more accurately be 
compared to an off-line file integrity verification tool. 


Closer to home, there is a Linux kernel patch 
written by Greg Kroah-Hartman, from the IBM Linux 
Technology Center. It is a proof of concept imple- 
menting digital signatures in kernel modules. Although 
it does not check binaries and has no use for caching, it 
is complementary. to DigSig in that the latter does not 
check Linux kernel modules. The patch modifies a file 
called module.c, which is responsible for kernel module 
handling. Unfortunately, LSM does not provide any | 
hooks here. Overall, being a proof of concept, the 
patch does not benefit from any form of benchmarking 
or flexibility. 


Availability 


DigSig is available from SourceForge at http:// 
disec.sourceforge.net. It is available under the GNU 
Public License version 2. BSign is available with the 
Debian project, from http://packages.debian.org/unstable/ 
admin/bsign.html. 


Conclusion 


In this paper, a new tool, named DigSig, is pre- 
sented. DigSig answers the needs of system adminis- 
trators in terms of run-time security of Linux operat- 
ing systems. It focuses on preventing execution of 
malicious code (ELF binaries or libraries) by checking 
an embedded RSA signature for each file. The imple- 
mentation is based on LSM hooks and optimized with 
a signature caching and revocation mechanism. 


On a deployment point of view, the paper has 
shown that DigSig introduces only little additional 
installation and management effort. In particular, the 
initial setup of the system which consists insigning all 
valid binaries and libraries can be launched once and 
for all by a single command. Future sporadic changes 
on the system do not require this step and are handled 
by DigSig’s signature revocation mechanism. From a 
security point of view, the paper has also demonstrated 
DigSig is believed to be safe, under reasonable 
assumptions for most security environments. 


DigSig has also been benchmarked, and the results 
indeed show a very small overhead at load time (a few 
nanoseconds per byte of executable’s size) and even 
less with the signature caching mechanism. It is there- 
fore reasonable to conclude DigSig should not impact 
machine’s performance from an end-user point of view. 


Finally, the paper has presented some other 
related work. Some, such as SELinux and TCG, may 
be used in addition of DigSig to supplement it. Others, 
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such as ExecShield, Tripwire, present similarities and 
differences that make them suitable for other situa- 
tions. To our knowledge, at this time, DigSig is the 
only GPL’ed run-time executable signature verifica- 
tion integrated to the Linux kernel. 


In the future, we mainly hope to extend our work 
to protect Linux systems against malicious shell scripts. 
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IPFS: An In-Kernel Integrity Checker and 
Intrusion Detection File System 


Swapnil Patil, Anand Kashyap, Gopalan Sivathanu, and Erez Zadok 
— Stony Brook University 


ABSTRACT 


Today, improving the security of computer systems has become an important and difficult 
problem. Attackers can seriously damage the integrity of systems. Attack detection is complex and 
time-consuming for system administrators, and it is becoming more so. Current integrity checkers 
and IDSs operate as user-mode utilities and they primarily perform scheduled checks. Such 
systems are less effective in detecting attacks that happen between scheduled checks. These user 
tools can be easily compromised if an attacker breaks into the system with administrator 
privileges. Moreover, these tools result in significant performance degradation during the checks. 


Our system, called IFS, is an on-access integrity checking file system that compares the 
checksums of files in real-time. It uses cryptographic checksums to detect unauthorized 
modifications to files and performs necessary actions as configured. I°FS is a stackable file system 
which can be mounted over any underlying file system (like Ext3 or NFS). PFS’s design improves 
over the open-source Tripwire system by enhancing the functionality, performance, scalability, and 
ease of use for administrators. We built a prototype of FS in Linux. Our performance evaluation 


shows an overhead of just 4% for normal user workloads. 


Introduction 


In the last few years, security advisory boards 
have observed an increase in the number of intrusion 
attacks on computer systems [2]. Broadly, these intru- 
sions can be categorized as network-based or host- 
based intrusions. Defense against network-based 
attacks involves increasing the perimeter security of 
the system to monitor the network environment, and 
setting up firewall rules to prevent unauthorized 
access. Host-based defenses are deployed within each 
system, to detect attack signatures or unauthorized 
access to resources. We developed a host-based sys- 
tem which performs integrity checking at the file sys- 
tem level. It detects unauthorized access, malicious 
file system activity, or system inconsistencies, and 
then triggers damage control in a timely manner. 


System administrators must stay alert to protect 
their systems against the effects of malicious intrusions. 
In this process, the administrators must first detect that 
an intrusion has occurred and that the system is in an 
inconsistent state. Second, they have to investigate the 
damage done by attackers, like data deletion, adding 
insecure Trojan programs, etc. Finally, they have to fix 
the vulnerabilities to avoid future attacks. These steps 
are often too difficult and hence machines are mostly 
re-installed and then reconfigured. Our work does not 
aim at preventing malicious intrusions, but offers a 
method of notifying administrators and _ restricting 
access once an intrusion has happened, so as to mini- 
mize the effects of attacks. Our system uses integrity 
checking to detect and identify the attacks on a host, 
and triggers damage control in a timely manner. 
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In our approach, given that a host system has 
been compromised by an attack, we aim at limiting the 
damage caused by the attack. An attacker that has 
gained administrator privileges could potentially make 
changes to the system, like modifying system utilities 
(e.g., /bin files or daemon processes), adding back- 
doors or Trojans, changing file contents and attributes, 
accessing unauthorized files, etc. Such file system 
inconsistencies and intrusions can be detected using 
Tripwire [10, 9, 22]. Tripwire is one of the most popu- 
lar examples of user mode software that can detect file 
system inconsistencies using periodic integrity checks. 
There are three disadvantages of any such user-mode 
system: (1) it can be tampered with by an intruder; (2) 
it has significant performance overheads during the 
integrity checks; and (3) it does not detect intrusions 
in real-time. Our work uses the Tripwire model for the 
detection of changes in the state of the file system, but 
does not have these three disadvantages. This is 
because our integrity checking component is in the 
kernel. 


In this paper we describe an in-kernel approach 
to detect intrusions through integrity checks. We call 
our system I7FS (pronounced as i-cubed FS), which is 
an acronym for In-kernel Integrity checker and Intru- 
sion detection File System. Our in-kernel system has 
two major advantages over the current user-land Trip- 
wire. First, on discovering any failure in integrity 
check, I2FS immediately blocks access to the affected 
file and notifies the administrator. In contrast, Trip- 
wire checks are scheduled by the administrator, which 
could leave a larger time-period open for multiple 
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attacks and can potentially cause serious damage to 
users and their data. Second, IFS is implemented 
inside the kernel as a loadable module. We believe that 
the file system provides the most well-suited hooks for 
security modules because it is one level above the per- 
sistent storage and most intrusions would cause file 
system activity. 


In addition to providing these advantages over 
Tripwire, our system is implemented as a stackable 
layer such that it can be stacked on top of any file sys- 
tem. For example, we can use stacking over NFS to 
provide a network-wide secure file system as well. 
Finally, it is easier to compromise user-level tools 
(like Tripwire) than instrumenting successful attacks 
at the kernel level. 


We used a stackable file system template gener- 
ated by FiST [28] to build an integrity checking layer 
which intercepts calls to the underlying file system. 
IFS uses cryptographic checksums to check for 
integrity. It stores the security policies and the check- 
sums in four different in-kernel Berkeley databases 
[8]. During setup, the administrator specifies detection 
policies in a specific format, which are loaded into the 
I-FS databases. File system specific calls trigger the 
integrity checker to compare the checksums for files 
that have an associated policy. Based on the results, 
the action is logged and access is allowed or denied 
for that file. Thus, our system design uses on-access, 
real-time intrusion detection to restrict the damage 
caused by an intrusion attack. 


Design 


Checksumming using hash functions is a com- 
mon way of ensuring data integrity. Recently, the use 
of cryptographic hash functions has become a standard 
in Internet applications and protocols. Cryptographic 
hash functions map strings of different lengths to short 
fixed size results. These functions are generally 
designed to be collision resistant, which means that 
finding two strings that have the same hash result is 
impractical. In addition to basic collision resistance, 
functions like MD5 [19] and SHA] [4] also offer ran- 
domness, unpredictability of the output, etc. In FS, 
we use MD5 for computing checksums. 


We have designed IFS as a stackable file system 
[26]. File system stacking is a technique to layer new 
functionality on top of existing file systems, as can be 
seen in Figure 1. With no modification to the lower 
level file system, a stackable file system operates 
between the virtual file system (VFS) and another file 
system. IFS intercepts file system calls and normally 
passes them to the lower level file system; however, 
IFFS also injects its integrity checking operations and 
based on return values to system calls, it affects the 
behavior that user applications see. 


When designing FS, we aimed at offering a 
good balance between security and performance. We 
offer configurable options that allow administrators to 
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tailor the features of FS to their site needs, trading 
off functionality for performance. 
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Figure 1: IFS Architecture. 


AccessDB 








Threat Model 


I-FS is primarily aimed at detecting the following: 

¢ Malicious replacement of vital files such as the 
ones in the /bin directory. Attackers could 
replace programs such as Is and ps with Tro- 
jans, without the knowledge of the system 
administrators. These kind of attacks can be 
tracked and prevented through PFS by setting 
up appropriate policies for important files. 
Unauthorized modification of data by an eaves- 
dropper in the network, in the case of remote 
file systems, where the client file system com- 
municates with the disk over an insecure net- 
work. 
Corruption of disk data due to hardware errors. 
Inexpensive disks such as IDE disks silently 
corrupt data stored in them due to magnetic 
interference or transient errors. These errors 
cannot always be detected by normal file sys- 
tems. IFS can notify the administrator about 
disk corruption if there is a suitable policy asso- 
ciated with the file. 


Policies 


The two main goals we considered when design- 
ing the policies for PFS were versatility and ease of 
use. The policy syntax provided by I?FS is similar to 
the user level Tripwire [10]. The general format of an 
FS policy is as follows: 

{-o|-e|-x} OBJECT -m FLAGS -p PROPERTIES 
-a ACTION [-g GRANULARITY] [-f FREQ] 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


Patil, Kashyap, Sivathanu, and Zadok 


where 
e -0 OBJECT specifies the object (file or direc- 
tory) for which the rule is valid. If the object is 
a directory, then the rule applies recursively to 
all the files and sub-directories. The -€ option is 
used to exclude an object (in most cases sub- 
directories) from the integrity checks, and the -x 
option is used to remove a policy. 
e -m FLAGS represents the set of attributes of the 
respective object, used to calculate the checksum. 
The supported attributes are as follows: 
Permission and file mode bits 
Inode Number 
Number of Links 
User id of the owner 
Group id of the owner 
File size 
ID of the device on which the inode 
resides 
Number of blocks allocated 
Access time 
Modification time 
Inode change time 
e - PROPER TIES represents the properties of 
the policies, used to calculate the checksum. 
The properties offered are as follows: 
D Checksum file data 
I Inherit the policies for new files 
-a ACTION determines the action taken if the 
integrity check failed. Our PFS implementation 
supports only two actions: BLOCK and NO- 
BLOCK. The BLOCK action returns a failure for 
any attempt to access the respective file and 
alerts the administrator about the inconsistency 
of this critical resource. The NO BLOCK action 
lets the operation go through I?FS to the under- 
lying file system. All integrity check failures 
are logged in the FS system. 
-¢ GRANULARITY specifies whether the 
checksumming is done on a per page basis or 
for the entire file at once. The available granu- 
larity options are PER PAGE or WHOLE FILE. 
PER PAGE is useful for mostly-random file 
access patterns, and WHOLE FILE is useful for 
mostly sequential small-file access patterns. 
-f FREQ is an integer value that determines the 
frequency of integrity checks. For example, a 
value of 50 for frequency would make I°FS per- 
form integrity checking for the file every 50 
times it is opened. This option is available only 
if WHOLE_FILE checksumming is chosen. 


anes = os 


esse 


We have chosen the set of policy options such 
that it helps detect most kinds of attacks on the file 
system. Checksumming different fields of the meta 
data of files helps detect whether important files have 
been re-written by malicious programs through the file 
system. Checksumming file data helps detect unautho- 
rized modification of data possibly made without the 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


I5FS: An In-Kernel Integrity Checker ... 


knowledge of the file system. An example of this is a 
malicious process that can write to the raw disk device 
directly in Unix-like operating systems. 
IFS Databases 

I-FS configuration data is stored in four different 
in-kernel databases. KBDB [8] is an in-kernel imple- 
mentation of the Berkeley DB [21]. Berkeley DB is a 
scalable, high performance, transaction-protected data 
management system that efficiently and persistently 
stores (key, value) pairs using hash tables, B+ trees, or 
queues. IFS stores four databases in the B+ tree for- 
mat, so that we benefit from locality. The schema for 
the four databases is given in Figure 2. 


_Patabive Mey ae 


Figure 2: I>FS: Database schemas. 







Having separate databases for storing the data and 
meta data checksums is advantageous in certain situa- 
tions. Generally, we expect that meta data checksum- 
ming would be used more commonly than data check- 
summing for two reasons. First, almost all modifica- 
tions to a file made through the file system will result 
in modifications to its meta data. Second, meta data 
checksumming is less time-consuming than data check- 
summing as the number of bytes to be checksummed is 
smaller. Therefore, having the data and meta data 
checksums in two different databases results in less I/O 
and more efficient cache utilization as data checksums 
need not be fetched along with meta data checksums. 


The policy database (policydb) contains the pol- 
icy options associated with the files and optionally the 
frequency of check values. We use the inode number 
to refer to the policies instead of the path names so as 
to avoid unnecessary string comparisons. We have a 
user level tool that reads the policy file and populates 
the policy database. Further details about initialization 
and setup are given later. The policy database has the 
inode number of the file as the key, and the data is 
either a 4 byte or an 8 byte value containing the policy 
bits and optionally the frequency of integrity checks 
(if the frequency of checks policy option is chosen). 


The data checksum database (datadb) contains the 
checksums of file data for those files that have a policy 
option for checksumming their data. Since there are 
two sub-options for checksumming file data, the per 
page and the whole file checksumming, this database 
either contains N of checksums for a single file, where 
N is the number of pages in the file, or a single check- 
sum for the entire file. The inode number and the page 
number form the key for this database. If the option is 
to checksum the whole file, then the single checksum 
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value will be indexed with page number zero. The data 
checksum database is populated during the I-FS initial- 
ization phase through an ioctl, when the policies are 
added to the policy database. 


The meta data checksum database (metadb) has a 
simple design. The key is the inode number and the 
data is the checksum value for the set of fields of the 
inode that are specified in the policy options. Informa- 
tion about the set of fields that are checksummed is 
not stored in the meta data checksum database. Instead 
it is retrieved from the policy database that stores the 
policy bits. This database is also populated during the 
initialization phase which we discuss later. 


The access counter database (accessdb) contains 
a counter that represents the number of times a file has 
been opened after the last time it was checked for 
integrity. This is useful to set custom numbers for fre- 
quency of checks, so that less important files need not 
be checked for integrity every time they are accessed. 
For files that have a policy indicating a custom fre- 
quency of check number, every read will result in get- 
ting the previous counter value, increasing it by one 
and saving the new value, if the counter has not 
exceeded the frequency limit. 
Caching in FFS 

In FS, each file access is preceded by a check 
whether that file has an associated policy or not. We 
expect that the number of files that have policies will 
be much less than the number of files without policies. 
Hence, it is important that we optimize for the com- 
mon case of a file without an associated policy. Sec- 
ond, for those files that have policies and are accessed 
frequently, checksums should not be re-computed on 
each access. 


I3FS caches two kinds of information. First, 
whether a given file has a policy associated with it or 
not, and if so, the policy for that file. Second, it caches 
the result of the previous integrity check. All informa- 
tion is cached in the private data of the in-memory 
inode objects. The inode private data includes several 
new fields to cache policies, meta data checksum 
results, whole file data checksum results, and per page 
checksum results. While caching the policies, the 
result of the check for the existence of a policy is also 
cached. This mechanism serves the purpose of having 
both a positive and negative cache for the existence of 
policies, thereby expediting the check for both files 
that have and those that do not have policies associ- 
ated with them. 


As a per page integrity checking cache, the inode 
private data contains an integer array with ten ele- 
ments which acts as a page bitmap. The cache can 
hold the integrity check results for the first 320 pages 
when run on an i386 system with page size of 4KB. 
Thus the results for files which are less than 5 MB can 
be fully cached. This accounts for almost 90% of the 
files in a normal system [7]. 
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The data and meta data integrity check result 
caches are invalidated every time there is a data or 
inode write for that file. Since all information is 
cached in the inode private data and not in an external 
list, the reclamation process for the inode cache will 
take care of the PFS configuration cache reclamation 
also. This method of caching is advantageous because 
the inodes for the frequently-accessed files will be 
present in the inode cache and hence the policies and 
results for those files will also be present in the cache. 


Securing F-FS Components 


Securing the databases that store the configura- 
tion and setup of I?FS is one of the prime requirements 
for making I?FS a secure file system. There can be 
valid updates to the checksums needed when a file 
needs to be genuinely updated and these updates have 
to follow a secure channel so that there can be a clear 
differentiation between authorized and unauthorized 
updates to the files. FS uses an authentication mech- 
anism to ensure that updates to the checksums are 
made by authorized personnel. 


FS stores the four databases that it uses in an 
encrypted form. We use the in-built cryptographic API 
provided by the Berkeley Database Manager for encryp- 
tion. We use the AES encryption algorithm [14] with a 
128-bit key size. Since PFS requires a key to be pro- 
vided for reading the encrypted database, we wrote a 
custom file system mount program that accepts the 
passphrase from the administrator. Having the database 
encrypted prevents unauthorized reading of the database 
file without going through the authentication process. 


Authentication 


An authentication mechanism is required for 
IFS for two reasons. First, mounting and setup of 
I-FS should be done through a secure channel so that 
malicious processes that acquire super user privileges 
could not mount the file system with incorrect config- 
uration options. Second, valid updates to the files that 
carry policies should be permitted only through a 
secure channel. This is because critical programs and 
files need to be updated occasionally and such updates 
should not require reinitialization of the file system. 


Since the IFS databases are encrypted, reading them 
requires a passphrase. Therefore we provide a custom 
mount program that authenticates the person mounting 
the file system. The first time IFS is mounted, the 
administrator is prompted for a passphrase. This 
passphrase is used to compute the cryptographic hash 
for a known word, ‘i3fspassphrase.”” We store this 
hash as part of the policy database. During subsequent 
mounts, the passphrase entered is validated by com- 
puting the hash again and comparing it with the stored 
hash. Upon mismatch of hashes, the mount process is 
aborted and an error message is returned to the user- 
level mount program. If the passphrase entered is cor- 
rect, it is stored in the private data of the super-block 
structure. Thus the passphrase is kept non-persistent 
and stored in the kernel memory only. 
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The checksums for all files that have a policy are 
computed during the initialization phase of PFS. To 
- allow valid updates to files whose checksums have 
already been stored, we provide two modes of opera- 
tions for PFS: one that allows updates and another 
that does not. This is implemented using a flag in the 
in-memory super block which can be set and unset 
from user level through an ioctl. This ioctl can be exe- 
cuted only after providing a valid passphrase. The 
passphrase passed to the ioctl is compared with the 
one that is stored in the super block private data and 
access is granted based on the result. A similar authen- 
tication method is implemented for the ADD POLICY 
and REMOVE POLICY ioctls as well. 


Actions Upon Failure 


There are two kinds of actions that can be speci- 
fied for files for which integrity checking fails. They 
are the BLOCK and NO-BLOCK options. The BLOCK 
option disallows access to files that fail integrity 
check, and a message is recorded in the log. In the 
case of NO-BLOCK option, access is allowed for files 
that fail integrity check but an appropriate message is 
logged. By default FS logs messages through syslog. 
Optionally, a log file name can be given as a mount 
option to the custom mount program, and all log mes- 
sages will be written directly to that file. 


Implementation 


I5FS is implemented as a stackable file system 
that can be mounted on top of any other file system. 
Unlike traditional disk-based file systems, I?FS is 
mounted over a directory, where it stores the files. In 
this section we discuss the key operations of IFS and 
their implementation. 


Initialization and Setup 


The first time the administrator mounts PFS, a 
passphrase needs to be entered for the file system to ini- 
- tialize itself. The first mount operation will store the 
HMAC hash of a known word, “i3fspassphrase,” 
hashed using the passphrase entered, into the policy 
database for authenticating further mounts. After the 
file system is mounted, the administrator has to run a 
user level setup tool that takes a policy file as input. 
The format of the policy file is described in the section 
on design policies. The user level utility calls the cor- 
responding ioctls to set up the policies to the four 
databases: 

® ADD POLICY: This ioctl takes the passphrase, 
path name, and policy bits (an integer) as input. 

It verifies the passphrase, converts the path 

name to an inode number, and stores the inode 

number and the policy bits in the policy data- 
base. In addition, based on the policy bits, the 
ioctl computes meta data and data checksums 
appropriately and inserts them into the meta 
data and data databases. 

© REMOVE POLICY: This ioctl takes a passphrase 
and path name as input. It authenticates and 
then converts the path name to an inode number 
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and removes all entries from all four databases 
that match the given inode number. 

® ALLOW_UPDATES: This ioctl takes the passphrase 
as argument. It just authenticates and sets the 
AUTO_UPDATE flag in the in-memory super block 
which allows updates to files with a policy, along 
with the update of their checksum values. 

® DISALLOW_UPDATES: This ioctl resets the 
update flag in the super block private data, so 
that further updates to checksums of files with 
policies are stopped. 


Recursive policies can also be specified for 
directories, so that the policy is applied to all files 
inside the directory tree. In this case, the user level 
program uses nftw(3) to enumerate the set of files for 
which the policies should be applied. It then calls the 
ioctl for each of the files. 


The usage of the user level program i3fsconfig is 
as follows: 


i3fsconfig [-u ALLOW|DISALLOW] 
[-£ POLICYFILE] 


Mount Options 


I5FS is mounted using a custom mount program 
that uses the mount(2) system call. It uses getpass(3) to 
accept a passphrase from the administrator and passes 
the passphrase as a mount option. A custom mount 
program is used instead of the Linux mount program 
because the passphrase entered should not be visible at 
the user level after the file system is mounted. 


There are three optional mount parameters. 

The auto-update option sets the AUTO_UPDATE 

flag in the kernel so that checksums will be 

updated every time a file with a policy is 

updated. 

The logfile option allows one to specify a sepa- 

rate log file where the I>FS log messages can be 

written to. 

¢ The dbdir option allows the administrator to set 
the location of the checksum databases. Nor- 
mally these databases are stored inside the file 
system, and I7FS prevents direct access to them 
and hides them from view. With this option, 
administrators can place the databases in a dif- 
ferent directory than the checksummed file sys- 
tem; this is useful, for example, when IFS is 
stacked on top of NFS because the databases 
could be kept on a safer local directory. 


Meta Data Integrity Checking 


The flowchart for meta data integrity checking is 
shown in Figure 3. For checksumming the meta data, 
since we have a customizable set of inode fields to be 
checksummed, we need to use both the policy bits and 
the stored checksums for integrity checking. The meta 
data integrity checking is done in the file permission 
check function, i3fs_permission. The i3fs_ permission 
function is called after lookup for every file that is 
accessed. Hence, the integrity check cannot be bypassed 
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for files with a policy. The permission check function 
first checks the policy cache to determine if the file’s 
inode has a policy associated with it. Upon a cache 
miss, it refers to the policy database and then decides 
the result. If there is a policy associated with the file, 
its policy bits are retrieved from the policy database. 
From the policy bits, the set of inode fields that have 
to be checksummed is found and the checksum for 
those fields is computed. This computed checksum is 
verified against the checksum value stored in the meta 
data checksum database. If both checksums match, 
then access is granted; if the checksums do not match, 
then the necessary action is performed as per the 
action policy bit. 


Is 


policy 


a Get policy 
res 


Cache policy 


Compute checksum 
Checksum 
Get stored checksum Databases 


Do 
checksums 
match? 


Cache result 


Cache result 





Figure 3: Flowchart for PFS permission checks. 


Once it is determined if the inode has a policy 
associated with it or not, the information is stored in 
the in-memory inode as a cache for further accesses. 
As long as that inode is present in the inode cache, the 
policy information will also be cached. 


Data Integrity Checking 


We provide two options for checksumming file 
data. The first is PER PAGE checksumming and the 
second is WHOLE FILE checksumming. In the case of 
per page checksumming, integrity checking is done in 
the page level read function, i3fs_readpage. If the 
option is to checksum whole files, then the integrity 
checking is done in the open function, i3fs_open. In 
the case of WHOLE FILE checksumming, whenever 
i3fs open is called for a file with a policy, the policy 
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bits present in the in-memory inode are checked. If the 
data checksum mode bit is set to WHOLE FILE, then 
the checksum for the whole file is computed and veri- 
fied against the checksum value stored in the data 
checksum database. If they match, i3fs open suc- 
ceeds; if not, the necessary action is performed as per 
the policy bits. The WHOLE FILE integrity check 
results are cached in the inode private data in a field 
named whole file result. 


In PER_PAGE integrity checking, during i3fs_ 
readpage, IFS checks the policy bits to determine 
whether page-level checksumming is enabled. If yes, 
then the checksum is computed for that page and it is 
compared with the stored value in the data checksum 
database. The result is cached in the page bitmap 
present in the inode private data as explained later. 


Frequency of Checks 


Since whole file checksumming is a costly oper- 
ation, we provide an option for specifying the fre- 
quency of integrity checks in the policy. For perfor- 
mance reasons, one can set up a policy for a file such 
that it will be checked for integrity every N times it is 
opened, where JN is an integer value. Every time a file 
with a policy is opened, we check if it has a frequency 
number associated with it. If yes, the counter entry for 
the file in the access database is incremented by one. 
When the value is equal to N, integrity check is per- 
formed and the counter is then reset to zero. 

Updating Policies 

Policies that are enforced when the file system is 
initialized might not remain valid at all times. We pro- 
vide a method by which the administrator can update 
the policies dynamically without reinitializing the sys- 
tem. This can be done using the following two ioctls: 
ADD_POLICY and REMOVE POLICY. The administrator 
can either add policies to new files or remove policies 
from existing files. If a policy for an existing file has to 
be modified, it has to be first removed and then re- 
inserted using the ADD_POLICY ioctl. 


Updating Checksums 


Often it is required that files with policies be 
updated from time to time. For example, administra- 
tors need to install or upgrade system binaries. Such 
updates should also re-compute the checksums that are 
stored in the databases, so that FFS need not be reini- 
tialized for every file update. However, these kinds of 
checksum updates should be allowed through a secure 
channel so as to prevent malicious programs from trig- 
gering checksum updates subsequent to an unautho- 
rized modification to file data. In FS, we provide a 
flag called AUTO UPDATE which can be set and reset 
by the administrator after authenticating using the 
passphrase. This can also be set during mount as a 
mount option. When the AUTO_UPDATE flag Is set, all 
updates to files with policies will update the check- 
sums associated with them. If the flag is not set, file 
data updates will be allowed without updates to the 
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checksums so that these are categorized as unautho- 
rized changes. The AUTO_UPDATE flag can only be set 
from a console for security reasons; processes that are 
executing in a non-console shell are not allow to 
update checksums when the AUTO_UPDATE flag is set. 


The checksum updates for meta data are done in 
the put inode file system method of PFS. Whole file 
checksums are updated in the file release method and 
per page checksums are updated in the writepage and 
commit write methods, respectively. 


Inheriting Policies 


To facilitate automatic policy generation for new 
files that get created after I°FS is initialized, we pro- 
vide a method for the policy of the parent directory to 
be inherited by the files and directories that are created 
under it. This can be used by setting the INHERIT pol- 
icy bit for the directory in question. Whenever a file or 
a directory is created, the policy of the parent direc- 
tory is copied for it, if its parent directory has the 
INHERIT bit set. However, for checksums to be updated 
for the new file, the AUTO_UPDATE flag must be set. If 
the flag is not set, then the policy of the parent direc- 
tory will be copied for the new file, but the checksums 
will not be updated. Thus the next time such a file is 
accessed there will be a checksum mismatch. 


Evaluation 


We used the stackable templates generated by 
FiST [28] as our base, and it started with 5,670 lines of 
code. To implement PFS, we added 4,227 lines of ker- 
nel code and 300 lines of user level code. In addition to 
this, PFS includes 367 lines of checksumming code 
implemented by Aladdin Enterprises [5]. We wrote two 
user level tools: a custom mount program for PFS and 
another tool for setting up, initializing, and configuring 
FS. PFS is implemented as a kernel module and 
requires the in-kernel Berkeley database [8] module to 
be loaded prior to using PFS. 


To measure the performance of IFS, we stacked 
IFS on top of a plain Ext2 file system and compared 
its performance with native Ext2. All measurements 
were conducted on a 1.7 GHz Pentium 4 with 1 GB 
RAM and a 20 GB Western Digital Caviar IDE disk. 
For the frequency of checks experiment, we lowered 
the amount of memory to 64 MB. The operating system 
we used was Red Hat Linux 9 running a vanilla 2.4.24 
kernel. We unmounted and remounted the file systems 
before each run of benchmarks so as to ensure cold 
caches. All benchmarks were run at least ten times and 
we computed 95% confidence intervals for the mean 
elapsed, system, user, and wait time using the Student-r 
distribution. In each case, the half widths of the inter- 
vals were less than 5% of the mean. In the graphs in 
this section, we show the 95% confidence interval as an 
error bar for the elapsed time. Wait time is the elapsed 
time less CPU time and user time and consists mostly 
of I/O, but process scheduling can also affect it. 
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We calculated the overheads of FS under sev- 
eral different configurations and a variety of system 
workloads. Based on the types of policies, we classi- 
fied the tests as follows: 

e Without any policies (NP) 
e Only meta data checksumming (MD) 
e Meta data and whole-file data checksumming 

(MW) 

e Meta data and per-page data checksumming 

(MP) 

e Meta data and whole-file checksumming with 
inheritable policies (MWI) 

e Meta data and per-page checksumming with 
inheritable policies (MPI) 


Each of the above configurations of I>FS are used 
to identify the isolated overheads of the components of 
IFS. The NP configuration does not compute check- 
sums, and is useful in finding the overheads due to the 
stackable layer and to check whether files have poli- 
cies associated with them or not. The MD configuration 
is used to find the overhead of checksumming the meta 
data alone. The other configurations, MW, MP, and MPI, 
isolate the overheads associated with each of the 
checksumming options described in the design section. 


We tested PFS using a CPU-intensive benchmark, 
an I/O-intensive benchmark, and a custom read bench- 
mark to test the frequency of checks performance. 


For a CPU-intensive benchmark, we compiled 
the Am-utils package [16]. We used Am-utils 6.1b3: it 
contains over 60,000 lines of C code in 430 files. The 
build process begins by running several hundred small 
configuration tests to detect system features. Then it 
builds a shared library, ten binaries, four scripts, and 
documentation: a total of 152 new files and 19 new 
directories. Although the Am-utils compile is CPU 
intensive, it contains a fair mix of file system opera- 
tions, which result in the creation of several files and 
random read and write operations on them. This com- 
pile benchmark was done for Ext2, as well as for >FS 
for the aforementioned six configurations. 


For an I/O-intensive benchmark we used Post- 
mark [23], a popular file system benchmarking tool. 
Postmark creates a large number of files and continu- 
ously performs operations that change the contents of 
the files to simulate a large mail server workload. We 
configured Postmark to create 20,000 files (between 
512 bytes and 1O0KB) and perform 200,000 transac- 
tions in 200 directories. Postmark was run on Ext2 and 
IFS with NP, MDI, MWI, and MPI configurations. The 
other configurations, MD, MW, and MP, are not relevant 
for Postmark, as Postmark creates a lot of new files 
and these configurations only apply to existing files. 


Finally, to measure the performance of FS with 
frequency of checks, we wrote a custom program that 
repeatedly performs read operations on a single file. 
We conducted this test for PFS with frequency of 
checks set to 1, 2, 4, 8, 16, and 32. 
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Am-utils Results 


Figure 4 shows the overheads of I>FS under dif- 
ferent configurations for an Am-utils compile. 


Elapsed time (sec) 





Ext2 I3FS-NP I3FS-MD ISFS-MW I3FS-MP I3FS-MDI I3FS-MPI ISFS-MWI 


Figure 4: Am-utils results for Ext2 and FS. 
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Figure 5: Postmark results for Ext2 and PFS. 


The configuration of PFS that has the maximum 
overhead when compared to regular Ext2 is the MWI 
configuration. It has an overhead of 4% elapsed time 
and 13% system time. Most of the elapsed time over- 
head is due to system time increase because of the 
checksum computation. The MWI configuration calcu- 
lates data checksums for whole files including the files 
that are newly created. Therefore, it has the highest 
overhead of all configurations. The elapsed time over- 
heads of all other configurations are less than 1%. The 
MPI configuration has a system time overhead of 7% as 
this configuration computes data checksums for files 
including newly created ones. The system time over- 
head of other configurations range from 2% to 3%. 


Since an Am-utils compile represents a normal 
user workload, we conclude that FS performs rea- 
sonably well under normal conditions. 

Postmark Results 

Figure 5 shows the overheads of Ext2 and I?FS 
for Postmark under the NP, MDI, MPI, and configura- 
tions. Since Postmark creates and accesses files on its 
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own, it can only exercise configurations that have 
inheritable policies. 


Unlike the Am-utils compile, for Postmark we 
were able to see a wide range of overheads for different 
configurations of PFS. The NP configuration had an 
elapsed time overhead of 9%. The system time over- 
head was 89%, which is mainly because of the check 
for the existence of policies. The overhead due to indi- 
rection of the stackable layer also adds to this overhead. 
The MDI configuration had an elapsed time overhead of 
67%. This overhead is partly because of checksum 
computation for the meta data during file creation and 
accessing. Database operations for storing and retriev- 
ing meta data checksums also contribute to the overall 
overhead. PFS under the MWI configuration was 3.5 
times slower than Ext2. This is because it computes, 
stores, and retrieves checksums for the data and meta 
data of all files, including newly created files. Finally, 
the MPI configuration was 4.5 times slower than Ext2. 
The MPI configuration checksums the meta data and the 
individual pages of all the files. The MPI configuration 
of PFS is slower than the MWI configuration because 
we configured Postmark to create files whose sizes 
range from 512 bytes to 1OKB. Thus the maximum 
number of pages a file can have is three as the page size 
is 4KB, and computing the checksums for the three 
pages in one shot is more efficient than checksumming 
individual pages separately. 

Since Postmark creates 20,000 files and performs 
200,000 transactions within a short period of just 10 
minutes, it generates a rather intensive I/O workload. 
In normal multi-user systems, such workloads are 
unlikely. The above benchmark shows a worst case 
performance of I7FS. Under normal conditions, the 
overheads of I?FS are reasonably good, as evident 
from the Am-utils compile results shown later. 


Frequency of Checks 


To measure the performance of PFS for whole 
file checksumming with frequency of checks enabled, 
we wrote a custom user level program that reads the 
first page of a 64 MB file 500 times. We ran this test 
with 64 MB RAM, so as to ensure that cached pages 
are flushed when the file is read sequentially during 
checksum computation phase. Since the file is read 
sequentially, by the time the last page of the 64 MB 
file is read, we can be sure that the first page is flushed 
out of memory. We calculated the difference in speeds 
for frequency values of 1, 2, 4, 8, 16, and 32. These 
numbers reduce the frequency that the checksums are 
computed logarithmically. Figure 6 shows the results 
of our custom benchmark for the different values of 
frequency of checks. 


As evident from the figure, the time taken is 
reduced logarithmically as the frequency number 
increases exponentially. We can see that the rate of 
decrease of the elapsed time and the system time is 
almost equal. This is because both the I/O for reading 
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the files and the checksum computation itself reduces 
as the frequency value increases. Without a custom 
frequency value (Freq-1), the program takes 1,393 
seconds to complete, and as the frequency value 
increases, the time taken reduces to 464 seconds, 279 
seconds, and so on. 


Wait —) 


User = 
Syste — 


Elapsed time (sec) 





Freq-1 Freq-2 Freq-4 Freq-8 Freq-16 Freq-32 
Figure 6: Frequency-of-checks benchmark results for 
frequency values of 1, 2, 4, 8, 16, and 32. 


Therefore, when system administrators are 
concerned about the system performance while 
checksumming whole files, they can set an appro- 
priate frequency of check value. 


Related Work 


In recent years, systems researchers have pro- 
posed various alternatives to increase the security of 
computer systems. These solutions can be broadly cat- 
egorized as user mode and kernel mode. Integrity 
checking of files is an important aspect of system 
security. Our work is an in-kernel approach to check 
integrity and detect intrusions in the file system. 


In this section we briefly discuss some previous 
work that addresses integrity checking and file system 
security in a broader sense. We discuss this in three 
categories: user-mode utilities, in-kernel approaches 
and other approaches that increase the security of the 
computer systems. 


User Tools and Utilities The open-source com- 
munity has developed various user-mode tools for file 
system integrity checking. While we follow the seman- 
tics of Tripwire [10, 9, 22] for our integrity checker, 
there are other similar tools. These include Samhain 
[20], Osiris [15], AIDE [18], and Radmind [3]. Most of 
these user-mode tools were modeled along the lines of 
Tripwire. AIDE and Radmind have been developed for 
UNIX systems with some more functions like threaded 
daemons and easy system management. In addition, 
Samhain uses a stealth mode of intrusion detection with 
remote administration utilities. 


In-Kernel Approach Linux Intrusion Detection 
System (LIDS) [12] is a more comprehensive system 
that modifies the Linux access control semantics, 
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called discretionary access control (DAC), thus offer- 
ing mandatory access control inside the kernel. In con- 
trast, our work does not change any Linux semantics 
or any modifications to the kernel. We leverage file 
system level call interposition using a loadable kernel 
module for file systems. 


A similar approach is used by Linux Security 
Modules (LSM) [24], which present an extensible 
framework with in-kernel hooks for adding new secu- 
rity mechanisms inside the kernel for file systems, 
memory management and the network sub-systems. 
LSM does not use any policies, but provides a founda- 
tion to add complete system security. 


Other Systems Security Apart from integrity 
checkers, there are other ways to increase the security 
of any host. System call interposition [17] uses the 
indirection of any call to a kernel function. This is a 
powerful tool for monitoring application behavior as 
soon as the context switches to kernel mode. Ostia [6] 
presents a model that delegates certain system critical 
responsibilities to a sandbox. This helps in localizing 
the impact of any attack after a pseudo off-line detec- 
tion process. In contrast, our approach uses interposi- 
tion of calls made by the virtual file system (VFS) on 
behalf of a file system. 


Another class of solutions uses call-graph analysis 
to backtrack any intrusions on a host [11]. These tech- 
niques aim to determine the vulnerability of the system 
used by the attacker to break into the system after an 
attack took place. In contrast, PFS tries to detect an 
intrusion or inconsistency in the system as it occurs. 


Finally, more recent work uses Virtual Machine 
Monitors (VMM) to detect any intrusions by placing the 
IDS in a more secure hardware domain [6]. This 
approach aims to minimize the impact of an attacker on 
the intrusion detection system. This approach has been 
tested for passive attack scenarios and incurs system 
overhead due to context switches across the interface 
between the OS and the VMM. In contrast, our approach 
has less overhead since we use fine-grained indirection. 


Conclusions 


We have described the design, operation, security, 
and performance of a versatile integrity checking file 
system. A number of different policy options are pro- 
vided with various levels of granularity. System admin- 
istrators can customize PFS with the appropriate 
options and policies so as to get the best use of it, 
keeping in mind performance requirements. As evident 
from the benchmark results, FS has a performance 
overhead of 4% compared to regular Ext2 under nor- 
mal user workloads. The encrypted database and cryp- 
tographic checksums make I?FS a highly secure and 
reliable system. 


Future Work 


Our group has previously developed secure and 
versatile file systems like NCryptfs [25, 27], Tracefs 
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[1] and Versionfs [13] and we would like to integrate 
the features of these file systems together with PFS so 
as to provide a highly secure and versatile system. 


Currently, FS cannot be customized to individ- 
ual users. We plan to add per user policies and options, 
so that individual users can set up security options for 
their own files, without requiring the intervention of 
the system administrators, but still allow administra- 
tors to override global policies. 
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ABSTRACT 


Existing systems for software deployment are neither safe nor sufficiently flexible. Primary 
safety issues are the inability to enforce reliable specification of component dependencies, and the 
lack of support for multiple versions or variants of a component. This renders deployment 
operations such as upgrading or deleting components dangerous and unpredictable. A deployment 
system must also be flexible (i.e., policy-free) enough to support both centralised and local 
package management, and to allow a variety of mechanisms for transferring components. In this 
paper we present Nix, a deployment system that addresses these issues through a simple technique 
of using cryptographic hashes to compute unique paths for component instances. 


Introduction 


Software deployment is the act of transferring 
software to the environment where it is to be used. 
This is a deceivingly hard problem: a number of 
requirements make effective software deployment dif- 
ficult in practice, as most current systems fail to be 
sufficiently safe and flexible. 


The main safety issue that a software deployment 
system must address is consistency: no deployment 
action should bring the set of installed software com- 
ponents into an inconsistent state. For instance, an 
installed component should never be able to refer to 
any component not present in the system; and upgrad- 
ing or removing components should not break other 
components or running programs [15], e.g., by over- 
writing the files of those components. In particular, it 
should be possible to have multiple versions and vari- 
ants of a component installed at the same time. No 
duplicate components should be installed: if two com- 
ponents have a shared dependency, that dependency 
should be stored exactly once. 


Deployment systems must be flexible. They 
should support both centralised and local package 
management: it should be possible for both site admin- 
istrators and local users to install applications, for 
instance, to be able to use different versions and vari- 
ants of components. Finally, it must not be difficult to 
support deployment both in source and binary form, or 
to define a variety of mechanisms for transferring 
components. In other words, a deployment system 
should provide flexible mechanisms, not rigid policies. 


Despite much research in this area, proper solu- 
tions have not yet been found. For instance, a sum- 
mary of twelve years of research in this field indicates, 
amongst others, that many existing tools ignore the 
problem of interference between components and that 
end-user customisation has only been slightly exam- 
ined [6]. Consequently, there are still many hard out- 
standing deployment problems (see the first section), 
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and there seems to be no general deployment system 
available that satisfies all the above requirements. 
Most existing tools only consider a small subset of 
these requirements and ignore the others. 


In this paper we present Nix, a safe and flexible 
deployment system providing mechanisms that can be 
used to define a great variety of deployment policies. 
The primary features of Nix are: 

e Concurrent installation of multiple versions and 
variants 

e Atomic upgrades and downgrades 

¢ Multiple user environments 

e Safe dependencies 

¢ Complete deployment 

e Transparent binary deployment as an optimisa- 
tion of source deployment 

e Safe garbage collection 

® Multi-level package management (i.e., different 
levels of centralised and local package manage- 
ment) 

© Portability 


These features follow from the fairly simple 
technique of using cryptographic hashes to compute 
unique paths for component instances. 


Motivation 


In this section we take a close look at the issues 
that a system for software deployment must be able to 
deal with. 


Dependencies For safe software deployment, it 
is essential that the dependencies of a component are 
correctly identified. For correct deployment of a com- 
ponent, it is necessary not only to install the compo- 
nent itself, but also all components which it may need. 
If the identification of dependencies is incomplete, 
then the component may or may not work, depending 
on whether the omitted dependencies are already 
present on the target system. In this case, deployment 
is said to be incomplete. 
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As a running example for this paper we will use 
the Subversion version management system (http:// 
subversion.tigris.org/). It has several (optional) depen- 
dencies, such as on the Berkeley DB database library. 
When we package the Subversion component for 
deployment, we must take this into account and ensure 
that the Berkeley DB component is also present on 
each target system. But it is easy to forget this! This is 
because such dependencies are often picked up 
‘‘silently.”” For instance, Subversion’s configure script 
will detect and use Berkeley DB automatically if 
present on the build system. If it is present on the tar- 
get system, Subversion will happen to work; but if it is 
not, it won’t: an incomplete deployment. Some exist- 
ing deployment systems use various tricks to automate 
dependency identification, e.g., RPM [11] can use the 
Idd tool at packaging time to scan for shared library 
dependencies. However, such approaches are either 
not general enough or not portable. 


Variability Components may exist in many vari- 
ants. Variants occur when different versions exist (1.e., 
almost always), and when a component has optional 
features that can be selected at build time. This is 
known as variability [24]. The Subversion component 
has several optional features, such as whether we want 
support for OpenSSL encryption and authentication, 
whether only a Subversion client should be built, and 
whether an Apache server module should be built so 
that Subversion can act as a WebDAV server. Of 
course, there also exist many different versions of 
Subversion, which we sometimes want to use in paral- 
lel (for instance, to test a new version before promot- 
ing it to production use on a server). A flexible 
deployment system should support the presence of 
multiple variants of a component on the same system. 
For instance, on a multi-user system different users 
may have different requirements and therefore need 
different variants; on a server system we may want to 
test a new component before upgrading critical server 
software to use it; or other components may have con- 
flicting requirements on some component. 


Consistency Unfortunately, most package man- 
agement disciplines do not support variants very well. 
Deployment operations (such as installing, upgrading, 
or renaming a component) are typically destructive: 
files are copied to certain locations within the file sys- 
tem, possibly overwriting what was already there. This 
can destroy the consistency among components: if we 
upgrade or delete some component, then another com- 
ponent that depends on it may cease to work properly. 
Also, it makes it hard to have multiple variants of a 
component installed concurrently, that is, different ver- 
sions of the component, or a version built with differ- 
ent parameters. For instance, the RPM packages for 
Subversion contain files such as /usr/bin/svn, making it 
impossible to have two versions installed at the same 
time. Worse, we might encounter unsatisfiable require- 
ments, e.g., if two applications both require mutually 
incompatible versions of some library. 
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Atomicity Component upgrades in conventional 
systems are not atomic. That is, while a component is 
being overwritten with a newer version, the compo- 
nent is in an inconsistent state and may well not work 
correctly. This lack of atomicity extends beyond the 
level of individual components. When upgrading an 
entire system, for instance, it may be necessary to 
upgrade shared components such as shared libraries 
first. If they are not backwards compatible, then there 
will be a timing window in which components that use 
them fail to work properly. 


Identification Variants make identification of 
dependencies surprisingly hard. We may say that a 
component depends on glibc-2.3.2, but what are the 
exact semantics of such a statement? For instance, it 
does not identify the build parameters with which glibc 
has been built, nor is there any guarantee that the iden- 
tifier glibc-2.3.2 always refers to the same entity in all 
circumstances. Indeed, versions of Red Hat Linux and 
SuSE Linux both have RPM packages called glibc-2.3.2, 
but these are not the same, not even at the source level 
(they have vendor-specific patches applied). 


Source/binary deployment We must often cre- 
ate both “‘source”’ and “‘binary”’ packages for a com- 
ponent. Creating the latter manually is unfortunate, 
since binary deployment can be considered an optimisa- 
tion of source deployment because it uses fewer 
resources on the target system. Ideally, the creation of 
binary packages would happen automatically and trans- 
parently, but in practice, the creation and dissemination 
of binary packages requires explicit effort. This is par- 
ticularly the case if multiple variants are required (which 
variants do we build, and how do users select them?). 


The source/binary dichotomy complicates depen- 
dency specification, since a component can have dif- 
ferent dependencies at build time and at run time that 
must be carefully identified. This is tricky, since a 
build time dependency can become a run time depen- 
dency if the construction process stores a reference to 
its dependencies in the build result — a retained depen- 
dency. For instance, various libraries such as 
OpenSSL are inputs to the Subversion build process. 
If they are shared libraries, then their full paths (e.g., 
/usr/lib/libssl.so.0.9.6) will be stored in the resulting 
Subversion executables, causing these build time 
dependencies to become run time dependencies. How- 
ever, if they are statically linked (which is a build time 
option of Subversion), then this does not occur. Thus, 
there is a subtle interaction between variant selection 
and dependencies. 


Centralised vs. local package management To 
make software deployment efficient, system adminis- 
trators should not have to install each and every appli- 
cation separately on every computer on a network. 
Rather, software installation should be managed cen- 
trally. On the other hand, computers or individual 
users may have individual software requirements. This 
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requires local package management. Software deploy- 
ment should cater for both local and centralised pack- 
age management. It should not be hard to define 
machine-local policies. 


Overview 


The Nix software deployment system is designed 
to overcome the problems of deployment described in 
the previous section. The main ingredients of the Nix’ 
system are the Nix store for storing isolated installa- 
tions of components; user environments, providing a 
user view of a selection of components in the store; 
Nix expressions, specifying the construction of a com- 
ponent from its sources; and a generic means of shar- 
ing build results between machines. These ingredients 
provide mechanisms for implementing a wide variety 
of deployment policies. In this section we give a high- 
level overview of these ingredients from the perspec- 
tive of users of the system. In the next section their 
implementation is described. 


Nix Store 


The fundamental problem of current approaches 
to software deployment is the confusion of user space 
and installation space. An end-user interacts with the 
applications installed on a computer through a certain 
interface. This may be the start menu on Windows and 
other desktop environments, or the PATH environment 
variable in command-line interfaces on Unix-like sys- 
tems. These interfaces form what we call the user 
space. Deployment is concerned with making applica- 
tions available through such interfaces by installing all 
files necessary for their operation in the file system, 
1.e., in the installation space. 


Mainly due to historical reasons — deployment 
was often done manually — user space and installation 
space are commonly identified. For instance, to keep 
the list of directories in the PATH manageable, applica- 
tions are installed in a few fixed locations such as 
lusr/bin. Thus, management of the end-user interface 
to applications is equal to physical manipulation of 
installation space, entailing all the problems discussed 
in the previous section. 


In Nix, user space and installation space are sep- 
arated. User space is a view of installation space. 
Applications and all programs and libraries used to 
implement them are installed in the Nix store. Each 
component is installed in a separate directory in the 
store. Directory names in the store are chosen so as to 
uniquely identify revisions and variants of compo- 
nents. This identification scheme goes beyond simple 
name+version schemes, since these cannot cope with 
variants of the same version of a component. Thus, 
multiple versions of a component can coexist in the 
store without interference. 


1The name Nix is derived from the Dutch word niks, mean- 
ing nothing; build actions do not see anything that has not 
been explicitly declared as an input. 
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Nix Expressions 


Installation of components in the store is driven 
by Nix expressions. These are declarative specifica- 
tions that describe all aspects of the construction of a 
component, i.e., obtaining the sources of the compo- 
nent, building it from those sources, the components 
on which it depends, and the constraints imposed on 
those dependencies. Rather than having specific built- 
in language constructs for these notions, the language 
of Nix expressions is a simple functional language for 
computing with sets of attributes. Figure | shows a 
Nix function that returns variants of the Subversion 
system, based on certain parameters; it features most 
typical constructs of the language. Figure 2 shows a 
call to this function. We will use these examples to 
explain the elements of the language. 


{ clientOnly, apacheModule, sslSupport 
, stdenv, fetchurl, openssl, httpd 


, db4 }: 
assert !clientOnly -> db4 != null; 
assert apacheModule -> !clientOnly; 
assert sslSupport -> (openssl != null 

&& (apacheModule -> 

httpd.openssl == openssl)); 

derivation { 

name = "subversion-0.32.1"; 


system = stdenv.system; 


builder = ./builder.sh; 
src = fetchurl { 
url = 
http://.../subversion-0.32.1.tgz; 
md5 = "b06717a8ef50db4b..."; 
i; 


## Pass these to the builder. 
inherit clientOnly apacheModule 
sslSupport; 
stdenv openssl httpd db4; 
} 


Figure 1: Subversion component (Subversion.nix). 


stdenv = import 
openssl = import ...; 
... # other component definitions 


subversion = (import subversion.nix) { 
clientOnly = false; 
apacheModule = false; 
sslSupport = true; 
inherit stdenv fetchurl openssl 
httpd db4 expat; 
); 


Figure 2: Subversion composition (pkgs.nix). 


Derivation The body of the expression is formed 
by calling the primitive function derivation with an 
attribute set {key=value;...}. The set contains two 
attributes required by the derivation function: the builder 
attribute indicates a script that builds the component, 
while the system attribute specifies the target platform 
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on which the build is to be performed. The other 
attributes define values for use in the build process 
(such as dependencies) and are passed to the build script 
as environment variables. The name attribute is a sym- 
bolic identifier for use in the high-level user interface; it 
does not necessarily uniquely identify the component. 


Parameters In order to describe variants of a 
component, an expression can be parameterised, i.e., 
turned into a function from Nix expressions to Nix 
expressions. The syntax for functions is {k,, ..., k,,}: 
body, which defines a function that expects to be called 
with an attribute set containing attributes with names 
k, to k,. Thus, the Subversion expression is parame- 
terised with expressions describing the components on 
which it depends (e.g., openssl, httpd, stdenv), options 
that select features (e.g., clientOnly, sslSupport), and a 
utility (fetchurl). The stdenv component provides all the 
basic tools that one would expect in a Unix-like envi- 
ronment, e.g., a C compiler, linker, and standard Unix 
utilities. Parameters are instantiated in a function appli- 
cation. For example, the expression in Figure 2 instan- 
tiates the Subversion expression by assigning values to 
its parameters. 


A subtle but important difference with most 
component formalisms is that in Nix we explicitly 
describe not just components but also compositions of 
components. For instance, an RPM spec file specifies 
how to build a component, but not its dependencies. It 
merely states fairly weak conditions on the expected 
build environment (‘‘a package called glibc-2.3.2 
should be present’). Thus, a spec file is always 
incomplete, so there is no way to uniquely specify 
concrete components. The Subversion Nix expression 
in Figure | is similarly incomplete, but the composi- 
tion in Figure 2 provides the whole picture — informa- 
tion on how to build not just Subversion, but also all 
of its dependencies. 


The value of the src attribute is another example 
of functional computation. Its value is the result of a 
call to the function fetchurl (passed in as an argument 
of the Subversion function) that downloads the source 
from a specific URL and verifies that it has the right 
MD5 checksum. 


Assertions In order to restrict the values that can 
be passed as parameters, a function can state asser- 
tions over the parameters. For example, the db4 data- 
base is needed only when a local server is imple- 
mented. Also, consistency between components can be 
enforced. For instance, if both SSL and Apache sup- 
port are enabled, then Apache must link against the 
same OpenSSL library as Subversion, since at runtime 
the Subversion code will be linked against Apache. If 
this were not enforced, link errors could result. 


Build When a derivation is built, the build script 
indicated by the builder attribute is invoked. As stated 
above, attributes of the derivation are passed through 
environment variables to the builder. In the case of 
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attributes that refer to other derivations (i.e., depen- 
dencies), the corresponding environment variables 
contain the paths at which they are stored. Nix ensures 
that such dependencies are built prior to the invocation 
of the builder, so the build script can assume that they 
are present. The special variable out conveys to the 
builder where it should store the build result. Figure 3 
shows the build script for Subversion. The largest part 
of the script is used to compute the configuration flags 
based on the features selected for the Subversion 
instance. By using a user-definable script for imple- 
menting the build of a component, rather than building 
in a specific build sequence, no requirements have to 
be made on the build interface of source distributions. 


buildInputs="Sopenssl S$db4 Shttpd" 
## Bring in GCC etc., set up environment. 
Sstdenv/setup 


if ! test S$clientOnly; then 
extraFlags="--with-berkeley-db=$db4 \ 
SextraFlags" 

£3 

if test S$sslSupport; then 
extraFlags="--with-ssl \ 

--with-libs=Sopenssl SextraFlags" 
fi 


tar xvf2 Ssre 

cd subversion-* 

./configure --prefix=Sout SextraFlags 
make 

make install 


Figure 3: Subversion build script (builder.sh). 


User Environments 


A Nix user environment consists of a selection of 
applications from the store currently relevant to a user. 
“Users” can be human users, but also system users 
such as daemons and servers that need a specific 
selection to be visible. This selection may be imple- 
mented in various ways, depending on the interface 
used by the user. In the case of the PATH interface, a 
user environment is implemented as a single directory 
— the counterpart of /usr/bin — containing symbolic 
links (or wrapper scripts on systems that do not sup- 
port them) to the selected applications. Thus, manipu- 
lation of the user environment consists of manipula- 
tion of this collection of symbolic links, rather than 
directories in the store. Installation of an application in 
user space entails adding a symbolic link to a file in 
the store and uninstallation entails removing this sym- 
bolic link instead of physically removing the corre- 
sponding file from the file system. 


While other approaches (e.g., [4]) also use a 
directory with symbolic links, these are composed 
manually and/or are only provided in a single location. 
In Nix an environment is a component in the store. 
Thus, any number of environments can coexist and 
variant environments can be composed with tools. 
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This separation of user space and installation space 
allows the realization of many different deployment 
scenarios. The following are some typical examples: 

e A user environment may be prescribed by a 
system administrator, or may be adapted by 
individual users. 

e Different users on the same system can com- 
pose different user environments, or can share a 
common environment. 

e A single user can maintain multiple ‘profiles’ 
for use in different working situations. 

e A user can experiment with a new version of a 
component while keeping the old (stable) ver- 
sion around for regular tasks. 

e Upgrading to a new version or rolling back to 
an old one is a matter of switching environ- 
ments. 

¢ Removal of unused applications can be 
achieved by automatic garbage collection, tak- 
ing the applications in user environments as 
roots. . 


For instance, to add the Subversion component 

in Figure 2 to the current user environment, we do: 
$ nix-env -f pkgs.nix -i subversion 

where pkgs.nix is the file containing the definition in Fig- 
ure 2. This will build Subversion and create a new user 
environment, based on the old one, to which Subversion 
has been added. If an expression for a new Subversion 
release comes along, we can upgrade as follows: 


S$ nix-env -f pkgs.nix -u subversion 


which likewise creates a new user environment, based 
on the old one, in which the old Subversion compo- 
nent has been replaced by the new one. However, the 
old user environment and the components included in 
it are retained, so it is possible to return to the old situ- 
ation if necessary: 


S$ nix-env --rollback 


There is no operation to physically remove compo- 
nents from the system. They can only be removed 
from a user environment, e.g., 


S$ nix-env -e subversion 


creates a new user environment from which the links 
to Subversion have been removed. However, storage 
space can be reclaimed by periodically running a 
garbage collector: 


$ nix-collect-garbage 


which removes any component not reachable from any 
user environment. (Therefore it is necessary to period- 
ically prune old user environments, e.g., once we find 
that we do not need to roll back to old ones). Garbage 
collection is safe because we know the full depen- 
dency graph between components. 


Sharing Component Builds 


The unique identification of a component in the 
store is based on all the inputs to the build process, 
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thus capturing all special configurations of the particu- 
lar variant being built. Thus, components can be iden- 
tified exactly and deterministically. Consequently a 
component can be shared by all components that 
depend on it. Indeed we even get maximal sharing: if 
two components are the same, then they will occupy 
the same location in the store. This means that builds 
can be shared by users on the same machine. 


Since the identification only depends on the 
inputs to the build process and the location of the 
store, store identifiers are even globally unique. That 
is, a component build can be safely copied to a Nix 
store on another machine. For this purpose, Nix pro- 
vides support for transparently maintaining a collec- 
tion of pre-built components on some shared medium 
such as an FTP site or an installation CD-ROM. After 
building a component in the store it can be pushed to 
the shared medium. 


For instance, the installation and upgrade opera- 
tions above perform an installation from source. This 
is generally not desirable since it is slow. However, it 
is possible to safely and transparently re-use pre-built 
components from a shared resource such as a network 
repository. For instance, a component distributor or 
system administrator can pre-build components, then 
push (upload) them to a server using PUT requests: 


S$ nix-push http://example.org/cache \ 
pkgs.nix subversion 


This will build Subversion (if necessary) and upload it 
and all its dependencies to the indicated site. A user 
can then make Nix aware of these: 


$ nix-pull http://example.org/cache 


Subsequent invocations of nix-env -i / -u will automati- 
cally use these if they are exactly equal to what the 
user is requesting to be installed. That is, if the user 
changes any sources, flags, and so on, the pre-built 
components will not be used, and Nix will revert to 
building the components itself. Thus, Nix is both a 
source and binary-based deployment system; deploy- 
ment of binaries is achieved transparently, as an opti- 
misation of a source-based deployment process. 
Policies 

Nix is policy-free. That is, the ingredients intro- 
duced above are mechanisms for implementing soft- 
ware deployment. A wide variety of policies can be 
based on these mechanisms. 


For instance, depending on the type of organisa- 
tion it may or it may not be desirable or possible that 
users install applications. In an organisation where 
homogeneity of workspaces is important, the selection 
and installation of applications can be restricted to sys- 
tem administration. This can be achieved by restricting 
all the operations on the store, and the composition of 
user environments to system administration. They may 
compose several prefab user environments for differ- 
ent classes of users. On the other hand, for instance in 
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a research environment, where individual users have 
very specific needs, it is desirable that users are capa- 
ble of installing and upgrading applications them- 
selves. In this situation environment operations and 
the underlying store operations can be made available 
to ordinary users as well. Similarly, Nix enables 
deployment at different levels of granularity, from a 
single machine, a cluster of machines in a local net- 
work, to a large number of machines on separate sites. 


Many other policies are possible; some are dis- 
cussed later. 


Implementation 


In this section we discuss the implementation of the 
Nix system. We provide an overview of the main com- 
ponents of the system, which we then discuss in detail. 


The Store 


The two main design goals of the Nix system are 
to support concurrent variants, and to ensure complete 
dependency information which is necessary to support 
completeness (the deployment process should transmit 
all dependencies necessary for the correct operation of 
a component). It turns out that the solutions to these 
goals are closely related. Other design goals are porta- 
bility (we should not fundamentally rely on operating 
system specific features or extensions) and storage 
efficiency (identical components should not be stored 
more than once). 


The first problem is dealing with variability, 1.e., 
concurrent variants. As we hinted in the previous sec- 
tion, we support this by storing each variant of a com- 
ponent in a global store, where they have unique 
names and are isolated from each other. For instance, 
one version or variant of Subversion might be stored 
in /nix/store/eeeeaf42e56b-subversion-0.32.1,2 while another 
might end up in /nix/store/3c7c39a10ef3-subversion-0.34 . 
To ensure uniqueness, these names are computed by 
hashing all inputs involved in building the component. 


Thus, each object in the store has a unique name, 
so that variants can co-exist. These names are called 
store paths. In Autoconf [1] terminology, each compo- 
nent has a unique prefix. The file system content refer- 
enced by a store path 1s called a store object. Note that 
a given store path uniquely determines the store object. 
This is because two store objects can only differ if the 
inputs to the derivations that built them differ, in which 
case the store path would also differ due to the hashing 
scheme used to compute it. Also, a store object can 
never be changed after it has been built. 


Figure 4 shows a number of derivates in the 
store. The tree structure simply denotes the directory 
hierarchy. The arrows denote dependencies, i.e., that 
the file at the start of the arrow contains the path name 
of the file at the end of the arrow, e.g., the program svn 


2The actual names use 32 hexadecimal digits (from a 
128-bit cryptographic hash), but they have been shortened 
here to preserve space. 
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depends on the library libc.so.6, because it lists the file 
Inix/store/8d013ea878d0-glibc-2.3.2/lib/libc.so.6 as one of 
the shared libraries against which it links at runtime. 


/nix/store 
eeeeaf42e56b-subversion-0.32.1 
bin 
svn 
lib 
libsvn_wc.so~ 
libsvn_ra_dav.so- 
a17fb5a6c48f-openssl-0.9.7c 
lib 
libssl.so.0.9.7~ 
8d013ea878d0-glibc-2.3.2 
lib 
libc.so.6= 
Figure 4: The Store. 


The use of these names also provides a solution 
for the dependency problem. First, it prevents unde- 
clared dependencies. While it is easy for hard-coded 
paths (such as /usr/bin/perl) to end up in component 
source, thereby causing a dependency that is easily 
forgotten while preparing for deployment, no devel- 
oper would manually write down these paths in the 
source (indeed, being the hash of all build inputs, they 
are much too “fragile” to be included). Second, we 
can now actually scan for dependencies. For instance, 
if the string 3c/7c39... appears in a component, we 
know that it has a dependency on a specific variant of 
Subversion 0.34. This in particular solves the problem 
of retained dependencies (discussed in the first sec- 
tion): it is not necessary to declare explicitly those 
build time dependencies that, through retention, 
become run time dependencies, since we can find 
them automatically. 


With precise dependency information, we can 
achieve the goal of complete deployment. The idea is 
to always deploy component closures: if we deploy a 
component, then we must also deploy its dependen- 
cies, their dependencies, and so on. That is, we must 
always deploy a set of components that is closed under 
the “depends on” relation. Since closures are self- 
contained, they are the units of complete software 
deployment. After all, if a set of components is not 
closed, it is not safe to deploy, since using them might 
cause other components to be referenced that are miss- 
ing on the target system. 


Building Components 


So how do we build components from Nix 
expressions? This could be expressed directly in terms 
of Nix expressions, but there are several reasons why 
this is a bad idea. First, the language of Nix expressions 
is fairly high-level, and as the primary interface for 
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developers, subject to evolution; i.e., the language 
changes to accommodate new features. However, this 
means that we would have to be able to deal with vari- 
ability in the Nix expression language itself: several 
versions of the language would need to be able to co- 
exist in the store. Second, the richness of the language 
is nice for users but complicates the sorts of operations 
that we want to perform (e.g., building and deploy- 
ment). Third, Nix expressions cannot easily be identi- 
fied uniquely. Since Nix expressions can import other 
expressions scattered all over the file system, it is not 
so straightforward to generate an identifier (such as a 
cryptographic hash) that uniquely identifies the 
expression. Finally, a monolithic architecture makes it 
hard to use different component specification for- 
malisms on top of the Nix system (e.g., we could 
retarget Makefiles to use Nix as a backend). 


For these reasons Nix expressions are translated 
into the much simpler language of store expressions, 
just as compilers generally do the bulk of their work 
on simpler intermediate representations of the code 
being compiled, rather than on a full-blown language 
with all its complexities. Store expressions describe 
how to build one or more store paths. Realisation of a 
store expressions means making sure that all those 
paths are present in the store. 


Derivation store expressions describe the build- 
ing of a single store component. They describe all 
inputs to the build process: other store expressions that 
must be realised first (build time dependencies), the 
build platform, the build script (which is one of the 
dependencies), and environment variable bindings. 
These are computed from calls to the derivation func- 
tion in the Nix expression language by recursively 
translating all input derivations to derivation store 
expressions, copying source files to the store, and 
adding all attributes as environment variable bindings. 


To perform the build action described by a 
derivation, the following steps are taken: 

1. Locks are acquired on the output path (the store 
path of the component being built) to ensure cor- 
rectness in case of parallel invocations of Nix. 

2. Input store expressions are realised. This 
ensures that all file system inputs are present. 

3. The environment is cleared and initialised to 

the bindings specified in the derivation. 

. The builder is executed. 

. If the builder was executed successfully, we 
build a closure store expression that describes 
the resulting closure, i.e., the output path and 
all store paths directly or indirectly referenced 
by it. We do this by scanning every file in the 
output path for occurrences of the crypto- 
graphic hashes in the input store paths. For 
instance, when we build Subversion, the path 
/nix/store/a17fb5a...-openssl-0.9.7c is passed as an 
input. After the build, we find that the string 
a1/fb5a... occurs in the file libsvn_ra_dav.so (as 
shown in Figure 4). Thus, we find that 


nn & 
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Subversion has a retained dependency on 
OpenSSL. Build time dependencies carried 
over to runtime are detected automatically in 
this way. (This approach is discussed in more 
detail in [9]). 
6. The closure expression is written to the store. 
The command nix-instantiate translates a Nix 
expression to a store expression: 
S nix-instantiate pkgs.nix 
/nix/store/ce87...-subversion.store 
The command nix-store --realise realises a derivation 
store expression, returning the resulting closure store 
expression: 
$ nix-store --realise \ 


/nix/store/ce87...-subversion.store 
/nix/store/ablf...77ef.store 


Nix users do not generally have to deal with 
store expressions. For instance, the nix-env command 
hides them entirely — the user interacts only with high- 
level Nix expressions, which is really just a fancy 
wrapper around the two commands above. However, 
store expressions are important when implementing 
deployment policies. Their relevance is that they give 
us a way to uniquely identify a component both in 
source and binary form, through the derivation and 
closure store expression, respectively. This can be 
used to implement a variety of deployment policies. 


A crucial operation for deployment is to query 
the set of store paths referenced by a store expression. 
This is the set of paths that must be copied to another 
system to ensure that it can realised there. For 
instance, for the derivation above we get: 

$ nix-store --qR \ 
/nix/store/ce87...-subversion.store 
/nix/store/ce87...-subversion.store 
/nix/store/dlbc...0aal-builder.sh 
/nix/store/f£184...3ed7-gcc.store 
/nix/store/f£199...0719-bash.store 


That is, this set includes the derivation store expres- 
sions for building Subversion itself and its direct and 
indirect dependencies, a closure store expression for 
the builder, and so on. 


On the other hand, for the closure we get: 


$ nix-store --qR \ 
/nix/store/ablf...77ef.store 
/nix/store/ablf...77ef.store 
/nix/store/eeee...e56b-subversion-0.32.1 
/nix/store/al7f...c48f-openssl1-0.9.7c 
/nix/store/8d01...78d0-glibc-2.3.2 


This set only includes the closure store expression 
itself and the component store paths it references. 


Substitutes 


With just the mechanisms described above, Nix 
would be a source-based deployment system (like the 
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FreeBSD Ports collection [2], or Gentoo Linux [3]), 
since all target systems would have to do a full build 
of all derivations involved in a component installation. 
This has the advantage of flexibility. Advanced users 
or system administrators can adapt Nix expressions to 
build a variant specifically tailored to their needs. For 
instance, required functionality disabled by default can 
be enabled, unnecessary functionality can be disabled, 
or the components can be built with specific optimisa- 
tion parameters for the target environment. The result- 
ing derivates may be smaller, faster, easier to support 
(e.g., due to reduced functionality), and so on. On the 
other hand, the obvious disadvantages are that source- 
based deployment requires substantial resources on 
the target system, and that it is unsuitable for the 
deployment of closed-source products. 


The Nix solution is to allow source-based deploy- 
ment to change transparently into binary-based deploy- 
ment through the mechanism of substitutes. For any 
store path, a substitute expression can be registered, 
which is also just a store derivation expression. Then, 
whenever Nix is asked to realise a closure that con- 
tains path p, and p does not yet exist, it will first try to 
build its substitute if available. The idea is that the 
substitute performs the same build as the original 
expression, but with fewer resources. Typically, this is 
done by fetching the pre-built contents of the output 
path of the derivation from the network, or from 
installation media such as a CD-ROM. This mecha- 
nism is generic (policy-free), because it does not force 
any specific deployment policy onto Nix. Specific 
policies are discussed later. 


Deployment Policies 


A useful aspect of Nix is that while it is concep- 
tually a source-based deployment system, it can trans- 
parently support binary deployment through the sub- 
stitute mechanism. Thus, efficient deployment consists 
of two aspects: 

e Source level: Nix expressions are deployed to the 
target system, where they are translated to store 
expressions and built (e.g., through nix-env). 

e Binary level: Pre-built derivates are made avail- 
able, and substitute expressions are registered 
on the target system. This latter step is largely 
transparent to the users. There is no apparent 
difference between a “‘source” and a “‘binary”’ 
installation. 


Source level deployment is unproblematic, since 
Nix expressions tend to be small. Typical deployment 
policies are to obtain sets of Nix expressions packaged 
into a single file for easier distribution, or to fetch them 
from a version management system. The latter is useful 
as it can easily allow automatic upgrades of a system. 
For instance, we can periodically (e.g., from a cron job) 
update the Nix expressions and build the derivations 
described by them. Note that any subexpressions that 
have not changed do not need to be rebuilt. 
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Binary level deployment presents more interest- 
ing challenges, since even small Nix expressions can, 
depending on the variability present in the expres- 
sions, yield an exponentially large set of possible store 
objects. Also, these store objects are large and may 
take a long time to build. Thus, we have to decide 
which variants are pre-built, who builds them, and 
where they are stored. 


Let us first look at the most simple deployment 
policy: a fixed selection of variants are pre-built, 
pushed onto a HTTP server, from where they can then 
be pulled by clients. To push a derivation, all elements 
in the resulting closure are packaged (e.g., by placing 
them into a .tar.gz archive). All of this is entirely auto- 
matic: to push the derivations of some expression 
foo.nix the distributor merely has to issue the command 
nix-push foo.nix. 


The client issues the command nix-pull to obtain a 
list of available pre-built components available from a 
pre-configured URL (i.e., the HTTP server). For each 
derivation available on the server, substitute expres- 
sions are registered that (when built) will fetch, 
decompress, and unpack the packaged output path 
from the server. Note that nix-pull is /azy: it will not 
fetch the packages themselves, just some information 
about them. 


subversion = {apacheModule, stdenv}: 
(import ./subversion.nix) 
{ clientOnly = false 
sslSupport = true 
apacheModule = apacheModule 


stdenv = stdenv, ... }; 
subversion’ = {stdenv}: 
[(subversion {apacheModule = true}) 
(subversion {apacheModule = false})]; 
subversion’’ = 
[(subversion’ 
{stdenv = stdenv-Linux} ) 


(subversion’ 
{stdenv = stdenv-FreeBSD})]; 


Figure 5: Variant selection. 


The issue of which variants to pre-build requires 
the distributor to determine the set of variants that are 
most likely to be useful. For instance, for the Subver- 
sion component, it may never be useful to not have 
SSL support, but it may certainly be useful to leave out 
Apache server support, since that feature introduces a 
dependency on Apache, which might be undesirable 
(e.g., due to space concerns). Also, the platform for 
which to build must be selected. Figure 10 shows how 
four variants of Subversion can be built. The function 
subversion supplies all arguments of the expression in 
Figure 1, except apacheModule and stdenv (which deter- 
mines the build tools, and thus the target platform). 
The function subversion’ uses this to produce two vari- 
ants, given a stdenv: one with Apache server support, 
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and one without. This function is in turn used by the 
variable subversion,”’ which calls it twice, with a stdenv 
for Linux and FreeBSD respectively. Hence, this eval- 
uates to 2 x 2 = 4 variants. 


Pre-building and pushing to a shared network 
site merely optimises deployment of common variant 
selections; it does not preclude the use of variants that 
are not pre-built. If a user selects a variant for which 
no substitute exists, the variant will be built locally 
from source. Also, input components such as compil- 
ers that are exclusively build time dependencies (that 
is, they appear in the derivation value but not in the 
closure value) will only be fetched or built when the 
variant must be built locally. 


The tools nix-pull and nix-push are not part of the 
Nix system as such; they are applications of the under- 
lying technology. Indeed, they are just short Perl 
scripts, and can easily be adapted to support different 
deployment policies. For instance, an entirely different 
policy is lazy construction, where clients push 
derivates onto a server if they are not already present 
there. This is useful if it is not known in advance 
which derivates will be needed. An example is mass 
installation of components in a heterogeneous net- 
work. In a peer-to-peer architecture each client makes 
its derivates available to all other clients (that is, it 
pushes onto itself, and pulls from all other clients). In 
this case there is no server, and thus, no need to pro- 
vide central storage scaling in the number of clients. 


User Environment Policies 


The use of cryptographic hashes in store paths 
gives us reliable identification of dependencies and 
non-interference between components, but we can 
hardly expect users to type, e.g., /nix/store/eeeeaf42e56b- 
subversion-0.32.1/bin/svn when they want to start a pro- 
gram! Clearly, we should hide these implementation 
details from users. 


We solve this problem by synthesising user envi- 
ronments. A user environment is the set of applica- 
tions or programs available to the user through normal 
interaction mechanisms, which in a Unix setting 
means that they appear in a directory in the user’s 
PATH environment variable. The user has in her PATH 
variable the path /nix/links/current/bin. /nix/links/current is 
a symbolic link (symlink) that points to the current 
user environment generation. Generations are sym- 
links to the actual user environment. They are needed 
to implement atomic upgrades and rollbacks: when a 
derivation is added or removed through nix-env, we 
build the new environment, and then create a genera- 
tion symlink to it with a number one higher than the 
previous generation. User environments are just sets 
of symlinks to programs of activated components 
(similar to, e.g., GNU Stow [4]), and are themselves 
computed using derivations. 


This is illustrated in Figure 6 (dotted lines denote 
symlinks), where the current symlink points to 
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generation 42, which is in turn a symlink to a user 
environment in the store. The user environment is sim- 
ply a tree of symlinks to activated components. Hence, 
the path /nix/links/current/bin/svn indirectly refers to 
/nix/eeee...-subversion-0.31.1/bin/svn . 


Figure 6 also shows what happens when we 
upgrade Subversion, and add Mozilla in a single 
atomic action. A new environment is constructed in 
the store based on the current generation (42), the new 
generation (43) is made to point to it, and finally the 
current link is switched to point at generation 43. The 
semantics of the POSIX rename() system call ensures 
that this is an atomic operation. That is, users and pro- 
grams always see the old set of activated programs, or 
the new set, but never neither, both, or a mix. Since 
old generations are retained, we can atomically down- 
grade to them in the same manner. 


/nix/store 


/nix/links 


bin 


84c85f89ddbf-user-env™.. 
bin ‘ 

es mozilla 

enn svn 

eeeeaf42e56b-su bversion-0.32.1 
bin ” 





58823d558a6a-subversion-0.34 
bin 

svn 
27140513a0f9-mozilla-1.4 


Tee 
"Peay 





Moe 


mozilla 
Figure 6: User environments. 


The generation links are the only external links 
into the store. This means that the only reachable store 
paths are those in the closure of the targets of the gen- 
eration links. The closure can be found using the clo- 
sure values computed earlier. Since all store paths not 
in this closure are unreachable, they can be deleted at 
will. This allows Nix to do automatic garbage collec- 
tion of installed components. Nix has no explicit oper- 
ation to delete a store path — that would be unsafe, 
since it breaks the integrity of closures containing that 
path. Rather, it provides operations to remove deriva- 
tions from the user environment, and to garbage col- 
lect unreachable store paths. Store paths reachable 
only from old generations can be garbage collected by 
removing the generation links. 


This scheme, where a user environment is cre- 
ated for the entire system, is just the simplest user 
environment policy. The creation of a user environ- 
ment is itself a normal derivation, and the command 
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nix-env used in the second section is a simple wrapper 
that automatically creates a derivation, builds it, and 
switches the current generation to the resulting output 
path. The build script used by nix-env for environment 
creation is a fairly trivial Perl script that creates sym- 
links to the files in its input closures. A simple modifica- 
tion is to allow profiles — environments for specific users 
or situations. This can be done by specifying a different 
link directory (e.g., /home/joe/.nixlinks). Also, multiple 
versions of the same program in an environment can be 
accommodated through renaming (e.g., a symlink 
svn-0.34), which is a policy decision that can be imple- 
mented by modifying the user environment build script. 


derivation { 
name = "site-env"; 
builder = ./create-symlinks.pl; 
inputs = [ 
((import ./subversion.nix) { ... }) 
(Cimpoxt ./mecilia.nix) { . «dd 


J; 
} 
Figure 7: A Nix expression to build a site-wide user 
environment (site-wide.nix). 


A more interesting extension is stacked user envi- 
ronments, where one environment links to the pro- 
grams in another environment. This is easily accom- 
modated: just as the inputs to the construction of an 
environment can be concrete components (such as 
Subversion), they can be other environments. The 
result is another indirection in the chain of symlinks. A 
typical scenario is a 2-level scheme consisting of a site- 
wide environment specified by the site system admin- 
istrators, with user-specific environments that augment 
or override the site-wide environment. Concretely, the 
site administrator makes a Nix expression as in Figure 
7 (slightly simplified) and makes it available on the 
local network. Locally, a user can then link this site- 
wide environment into her own environment by doing 

nix-env -f£ site-wide.nix -i site-env 

where site-wide.nix refers to the Nix expression. This 
will replace any previously installed derivation with 
the symbolic name site-env. To ensure that changes to 
the site-wide environment are automatically propa- 
gated, these commands can be run periodically (e.g., 
from a cron job), or initiated centrally (by having the 
administrator remotely execute them on_ every 
machine and/or for every user). 


Should components in the local environment 
override those in the site-wide environment? Again, 
this is a policy decision, and either possibility is just a 
matter of adapting the builder for the local user envi- 
ronment, for instance to give precedence to deriva- 
tions called site-env. 


Server configurations User environments (con- 
trary to what the term implies) can not only be used to 
specify environments for specific users, but also for 
specific tasks or processes. In particular, they can be 
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used to specify complete server configurations, which 
includes not only the software components constitut- 
ing some server, but also its configuration and other 
auxiliary files. Consider, for instance, an Apache/Sub- 
version server (the Subversion server runs as a module 
on top of Apache). It consists of several components 
that are rather picky about specific dependencies, e.g., 
Apache, Subversion, ViewCVS, Python, and Perl, but 
also our repository management CGI scripts, static 
HTML documents and images, the Apache httpd.conf 
configuration file, SSL private keys, and so on. Since 
these are also components (just not necessarily exe- 
cutable components) they can be managed using Nix. 


Figure 8 shows a (simplified) Nix expression for 
an Apache/Subversion server. It takes a single argu- 
ment that specifies whether a test or production server 
is to be built. The builder produces a component con- 
sisting of an Apache configuration file, and a control 
script to start and stop the server. The builder gener- 
ates these by substituting values such as the desired 
port number and the paths to the Apache and Subver- 
sion components into the given source files. 


{productionServer}: 

derivation { 
builder = ./builder.sh; 
configuration = ./httpd.conf.in; 
controller = ./ctl.sh.in; 
portNumber = if productionServer 

then 80 else 8080; 

inherit (import ...) httpd subversion; 


} 


Figure 8: A Nix expression to build a Subversion 
server. 


Now, given a simple script upgrade-server (not 
shown here) that uses nix-env -U to build the new server 
configuration, stop the server running in the old genera- 
tion, and start the new one, we can easily instantiate 
new server configurations by editing source files such 
as httpd.conf.in, and calling upgrade-server. For instance, 
the command upgrade-server test instantiates the Nix 
expression by calling it with a false argument, thus pro- 
ducing a test server. If this is found to work properly, 
we can issue upgrade-server production to upgrade the 
production server. nix-env —rollback can be used to go 
back to the previous generation, if necessary. 


The server is started using the script controller.sh 
which is part of the server configuration component. It 
initialises PATH to point to a specific set of compo- 
nents. This means that the server configuration is self- 
contained: it does not depend on anything not explic- 
itly specified in the Nix expression. Such a configura- 
tion is therefore pretty much immune to external con- 
figuration changes, and can be relatively easily trans- 
ferred to another machine. 


The only thing not under Nix control here is state 
— things that are modified by the server, e.g., the actual 
Subversion repositories and user account databases. 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


Dolstra, de Jonge, and Visser 


Thus, Nix can be used for the deployment of not 
just software components, but also complete system 
configurations — the domain of tools such as Cfengine 
[7]. Note that Cfengine declaratively specifies destruc- 
tive changes to be performed to realise a desired con- 
figuration. This makes it hard to easily run several 
configurations in parallel on the same machine, or to 
switch back and forth between configurations. Also, 
Cfengine is typically not used to manage the software 
components on a machine (although this is possible, 
e.g., by installing the appropriate packages in an 
Cfengine action [20]). 


Experience 


We have applied Nix to a number of problem 
domains. 


Software deployment We have “nixified” 180 
or so existing Unix packages, including large ones 
such as Mozilla Firefox with all its dependencies 
(which includes the C compiler, basic Unix tools, X11, 
etc.). They are prebuilt for Linux and made available 
through the push/pull mechanism. 


The fundamental limitation to Nix’s dependency 
checking is that it will not prevent undeclared depen- 
dencies on components outside of the store. For 
instance, if a builder calls /bin/sh, we have no way to 
detect this. To minimise the probability of such unde- 
clared dependencies, we use patched versions of gcc, 
ld, and glibc that refuse to use header files and libraries 
outside of the Nix store. In our experience this works 
quite well. For instance, the prebuilt Nix packages 
work on a variety of Linux distributions — evidence 
that no (major) external components are used. A com- 
mon problem with these distributions is that they often 
differ in subtle ways that cause packages built on one 
system to fail on another, e.g., because of C library 
incompatibilities. However, our Nix components are 
completely boot-strapped, that is, they are built using 
only build tools, libraries, etc., that have themselves 
been built using Nix, and do not rely on components 
outside of the Nix store (other than the running ker- 
nel). Using our reliable dependency analysis, any 
required libraries and other components are deployed 
also. Thus, they just ‘“‘work.”’ 


The ability to very rapidly perform rollbacks is 
often a life-saver. For instance, it happens quite fre- 
quently that we attempt to upgrade some bleeding- 
edge software package, only to discover that it doesn’t 
work quite as well as the previous version (or not at 
all!). A simple nix-env --rollback saves the day. In most 
package managers, recovery would be much harder, 
since we would have to know exactly what the previ- 
ous configuration was, and we would have to have a 
way to re-obtain the old versions of the packages that 
were just upgraded. 


Service deployment As described later, Nix can 
be used for the deployment of not just software com- 
ponents, but also complete configurations of system 
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services. For instance, our department’s Subversion 
server is managed in this way. The main advantages 
are that it is very easy to run multiple instances of a 
service (e.g., for testing — and the test server will in no 
way interfere with the production server!), that it is 
easy to move a service to another machine since we 
have full dependency information, and again that we 
can rollback to earlier versions. 


Build farms It is a good software engineering 
practice to build software systems continuously during 
the development process [13]. In addition, if software 
is to be portable, it should be built on a variety of 
machines and configurations. This requires a build 
farm — a set of machines that sit in a loop building the 
latest version obtained from the version management 
system. Build farms are also important for release 
management — the production of software releases — 
which must be an automatic process to ensure repro- 
ducibility of releases, which is in turn important for 
software maintenance and support. 


The management of a build farm is often highly 
time-consuming. For instance, if the component being 
built in the build farm requires (say) Automake 1.7, 
we must install that version of Automake on each 
machine in the build farm. If at some point we need a 
newer version of Automake, we again must go to each 
machine to perform the upgrade. So maintaining a 
build farm scales badly. Worse, there may be conflict- 
ing dependencies (e.g., some other component in the 
build farm may only work with Automake 1.6). 


Such management of dependencies is exactly 
what Nix is good at, so we have implemented a build 
farm on top of Nix. The main advantages over other 
build farms (e.g., [12]) are: 

e The Nix expression language makes it easy to 
describe the build tasks, along with their depen- 
dencies. 

e Nix ensures that the dependencies are installed 
on each machine in the build farm. 

e The hashing scheme ensures that identical 
builds (e.g., of dependencies) are performed 
only once. 

e In Nix, each derivation has a system attribute 
that specifies on what kind of platform the 
derivation is to be performed (e.g., i686-linux). 
If the attribute does not match the type of the 
platform on which Nix is run, Nix can automat- 
ically distribute the derivation to a different 
machine of the intended platform type, if one 
exists. All inputs to the derivation are copied to 
the store of the remote machine, Nix is run on 
the remote machine, and the result is copied 
back to the local store. Thus, dealing with 
multi-platform builds is fairly transparent: we 
can write a Nix expression specifying deriva- 
tions on a variety of platforms and run it on a 
arbitrary machine. There is no need to schedule 
the build separately on each machine. 
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e The resulting builds can be used immediately 
by other developers since they are made avail- 
able through nix-push. 


A downside to a Nix-based build farm is that 
installing a package through Nix differs from the 
“native” way of installing a package on existing plat- 
forms (e.g., by installing an RPM on a Red Hat 
machine). Thus it is difficult for a Nix build farm to 
verify whether a package works when built from 
source in the native way. However, on Linux systems, 
we can in fact build native packages (such as RPMs) 
without affecting the host system by using User-Mode 
Linux [5] in Nix derivations. In fact, this fits in quite 
well. For instance, the synthesis of the UML disk 
images for the various platforms for which we build 
packages is just a normal Nix derivation that creates 
an Ext2 file system from an arbitrary set of RPMs 
constituting a Linux distribution. 


Related Work 


Centralised and local package management 
Package management should be centralised but each 
machine must be adaptable to specific needs [26]. 
Local package management is often ignored in favour 
of centralised package management [19, 17]. In our 
approach, central configurations can easily be shared 
and local additions can be made. Any user can be 
allowed to deviate from a central configuration. Soft- 
ware installation by arbitrary users is discussed in 
[21]. In [25] policies are introduced that define which 
installation tasks are permitted. This might be a chal- 
lenging extension to Nix. Modules [14] makes soft- 
ware deployment more transparent by abstracting 
from the details of software deployment. Application- 
specific deployment details are captured in “‘module- 
files,’ which can be shared between large-scale dis- 
tributed networks, similar to Nix expressions. Mod- 
ules lack the safety properties of Nix. As a result, cor- 
rect operation of typical deployment tasks, as dis- 
cussed initially, cannot be guaranteed. 


Non-interference Software packages should not 
interfere with each other. Typical interference is 
caused by attempting to have multiple versions of a 
component installed. It is important that multiple ver- 
sions can coexist [21], but this is difficult to achieve 
with current technology [6]. A common approach is to 
install software packages in separate directories, 
sometimes called collections [26]. In [18], a directory 
naming scheme is used that restricts the number of 
concurrent versions and variants of a package. Sharing 
is in most deployment systems either unsafe due to 
implicit references, or not supported at all because 
every application is made completely self-contained 
[17, 19]. Sharing of data across platforms using a 
directory structure that separates platform specific 
from platform independent data is discussed in [17], 
which is concerned with diversity in platform, not 
diversity in feature sets. As a consequence however, 
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exchange and sharing of packages is not truly safe, as 
is the case for Nix. 


Safe upgrading Many systems ignore this issue 
[26]. Automatic rollback on failures is discussed in 
[19]. This turned out to be undesirable in practice 
because it increased installation time and did not 
increase consistency. RPM [11] has a notion of trans- 
actions: if the installation of a set of packages failed, 
the entire installation is undone. This is not atomic, so 
the packages being upgraded are in an inconsistent 
state during the upgrade. The approach discussed in 
[18] uses shortcuts to default package versions, e.g., 
emacs pointing to emacs-20.2. This is unsafe because 
programs may now use emacs which initially corre- 
sponds to emacs-20.2, but after an upgrade points to, 
e.g., emacs-20.3. Separation of production and devel- 
opment software via directories is discussed in [17]. 
Once an application has been fully tested under the 
development tree it is turned into production. This 
requires recompilation because path names will 
change and may cause errors. Consequently, the 
approach is not really safe. 


Garbage collection In [21] an approach for 
removing old software is discussed. Basically, after 
software is “removed” by making the directory 
unreadable, one verifies whether other software fails 
by running it. If so, the deletion is rolled back by mak- 
ing the directory readable again. This is unsafe 
because the test executions may not reveal every 
dependency, and because a time window Is introduced 
during which some components do not work. 


Dependency analysis However, the same paper 
also describes a pointer scanning mechanism similar 
to ours: component directories are scanned for the 
names of other component directories (e.g., tk-3.3). 
However, such names are not very unique (contrary to 
cryptographic hashes) and may lead to many false 
positives. Also, component dependencies are scanned 
for after the component has been made unreadable, 
not before. In [23] a dependency analysis tool for 
dynamic libraries is discussed. In Nix this information 
is already available when an application is installed, 
and Nix is not restricted to detecting dependencies on 
shared libraries only. Vesta [16] is a system for config- 
uration management that supports automatic depen- 
dency detection. Like Nix, it detects only dependen- 
cies that are actually needed, and dependencies are 
complete, i.e., every aspect of the computing environ- 
ment is described and controlled by Vesta. 


Safety In [25] common wrong assumptions of 
package managers are explained, including: 1) package 
installation steps always operate correctly; 11) all soft- 
ware system configuration updates are the result of 
package installation. In Nix, software gets installed 
safely, without affecting the environment. Thus, in con- 
trast to many other systems, Nix will never bring a sys- 
tem in an unstable state. Unless a system administrator 
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really wants to mess things up, all upgrades to the Nix 
store are the result of package installation. Safe testing 
of applications outside production environments is dis- 
cussed in [21, 17]. In [15] it is confirmed that software 
should be installed in private locations to prevent inter- 
ference between software packages. Interference turns 
out to be a very common cause of installation problems. 
In Nix, such packages can safely coexist. 

Packaging In [22] a generic packaging tool for 
building bundles (i.e., collections of products that may 
be installed as a unit) is discussed. Source tree compo- 
sition [8] is an alternative technique for automatically 
producing bundles from source code components. 
However, these bundling approaches do not cater for 
sharing of components across bundles. 


Conclusion 


Seemingly simple tasks such as installing or 
upgrading an application often turn out to be much 
harder than they should be. Unexpected failures and 
the inability to perform certain actions affect users of 
all levels of computing expertise. In this paper we 
have pinpointed a number of causes of the deployment 
malady, and described the Nix system that addresses 
these by using cryptographic hashes to enforce 
uniqueness and isolation between components. It is 
successfully used to deploy software components to 
several different operating systems, to manage server 
configurations, and to support a build farm. 


There are a number of interesting issues remain- 
ing. Of particular interest is our expectation that Nix 
will permit sharing of derivations between users. That 
is, if user A has built some derivation, and user B 
attempts to build the same derivation, B can transpar- 
ently reuse A’s result. Clearly, using code built by oth- 
ers is not safe in general, since A may have tampered 
with the result. However, our use of cryptographic 
hashes can make this safe, since the hash includes all 
build inputs, and therefore completely characterises 
the result. 


The problems of dependency identification and 
dealing with variants also plague build managers such 
as Make [10]. We believe that (with some extensions) 
Nix can be used to replace these more low-level soft- 
ware configuration management tools as well. 


Availability 


Nix is free software and is available online at 
http://www.cs.uu.nl/groups/ST/Trace/Nix . 
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ABSTRACT 


A tool is described that provides for the automatic configuration of systems from a single 
description. The tool, newfig, uses two simple concepts to provide its functionality: boolean logic 
for making decisions and file construction for generating the files. Newfig relies heavily on 
external scripts for anything beyond the construction of files. This simple yet powerful design 
provides a mechanism that can easily build on other tools rather than a single monolithic stand- 
alone program. This provides for a great deal of flexibility while maintaining simplicity. The 
language is a combination of boolean logic and output statements, and also provides for macros 
and other essential elements. All output is written to channels: an abstraction which provides for 
extensive configurability. Examples are provided that show the tool’s power and flexibility. A 
description is also provided of the efforts undergone at CNN to integrate this tool in to a sizable 
infrastructure. The paper concludes with a discussion of future improvements. 


Introduction 


As the number of systems in an infrastructure 
grows, the administration problem grows with it. 
Unless the engineering staff grows accordingly, at some 
point the management of the systems will need to be 
automated. This realization is nothing new, and several 
successful tools have been developed to meet this need. 


Configuration data used by a system to control 
how it behaves can be broken down into three basic 
categories: information which is unique to a system, 
information which is common to a proper subset of 
systems, and information which is common to all sys- 
tems in the infrastructure. Unique information includes 
a system’s hostname, local disk partitions, and network 
address assignments. Global information includes such 
files as /etc/services and /etc/networks. The partially 
global information is the most interesting: it is com- 
mon across systems which share a similar function but 
not necessarily others. The task of configuring a sys- 
tem requires that all of this information be in place and 
correct. The most complicating factor in achieving a 
correct configuration is in the management of files 
which combine data from more than one category. 


A file that only contains unique data can be 
copied into place when a system is installed. Such files 
are rarely touched after system installation. Files that 
contain only global information can be easily updated 
from a central repository. Likewise, files that contain 
partially global information can be updated from a cen- 
tral authority provided there is some mechanism that 
distinguishes among the different classes of systems 
and is able to determine which data are appropriate for 
the system. However, when a file contains a mixture of 
these data categories, its maintenance becomes signifi- 
cantly more difficult. This is most noticeable in files 
such as fstab, inetd.conf, hosts.allow, rc script directo- 
ries, and sometimes passwd. 
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When we sought out a tool for automated sys- 
tems configuration, we were looking for certain prop- 
erties that we believe would be beneficial in our envi- 
ronment. We wanted the system to be idempotent, 
congruent, deterministic, transportable, extensible, and 
fail-safe. We wanted a specification language that was 
both clear and concise, to minimize training and maxi- 
mize understanding. When a tool could not be found 
that met all of these criteria, we set out to design a 
new solution to the problem. The result, newfig, takes 
a new approach to the problem while providing all of 
the qualities that we believe are important. As a result, 
very little is built in to newfig. Instead it is a frame- 
work which can utilize other tools and scripts to 
accomplish results. Rather than provide a wide range 
of built-in mechanisms, newfig uses file construction 
as its only primitive operation. Its configuration is a 
purely declarative language based on boolean logic. 
The files that newfig constructs, called channels, can 
be used to replace existing files on the system or as 
input to external programs (including scripts). As a 
result, the system is naturally extensible. Newfig is 
used to generate input for and monitor the execution 
of the programs and scripts which perform the actual 
modifications to the system. It provides a structure 
around which system administrators can do what they 
do best: automate through scripting. We believe that 
the resulting system meets all our initial design goals 
and provides us with an excellent platform for auto- 
mated configuration. 


Related Work 


Prescriptions [8] is a declarative language for 
describing the desired state of configuration for dis- 
tributed systems. It provides mechanisms for specify- 
ing operations that may be used to bring systems into 
conformance with a specification. However, it does 
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not seem to provide mechanisms for file distribution 
and synchronization. We have not been able to find 
recent work on this project since Thornton’s 1994 
technical report. 


CFengine [1] is the most widely known work in 
this area. It is to automated systems configuration 
what awk(1) is to scripting. The main concepts are 
patterns, actions, and an execution context that is man- 
aged by an interpreter. It provides a rich body of built- 
in actions, but has little room for expansion beyond 
that set. The configuration drives actions to perform 
on a system, whereas newfig provides a description of 
the desired target for the system and enforces con- 
formity to that description. CFengine implements con- 
vergence by providing a mechanism that brings a sys- 
tem closer to an ideal. In our environment we sought a 
tool that implements congruence, which is a tighter 
standard than convergence. The idea of file construc- 
tion is not central to CFengine, and a CFengine con- 
figuration that uses such a methodology tends to be 
cumbersome. CFengine is also not able to remove 
changes implicitly: such steps must be explicitly given 
in the configuration. 


Psgconf [6] takes a highly modular approach to 
configuration management, providing hooks for vari- 
ous data stores, policy rules, and actions. The configu- 
ration parsing is order dependent, making psgconf a 
procedural configuration tool rather than declarative, 
which has some advantages and some drawbacks. 


Site [3] uses declarative statements to describe 
the configuration of a computing site at three levels of 
abstraction. At the lowest levels, drivers written in C 
are used to construct the contents of configuration 
files. The paper describes a prototype implementation 
only. In contrast, newfig provides no restrictions on the 
language used to construct channels, which is good for 
admins who have long since shed their systems pro- 
gramming scales (or never had them to begin with). 


PIKT [5] is an interpreted scripting language, 
preprocessor, and scheduler that is primarily intended 
to monitor systems, reporting problems and taking 
corrective action when possible. Over time, it has been 
extended to include configuration management fea- 
tures, but most of the terminology in the language is 
built around monitoring. For example, scheduling 
periodic execution of a script involves adding it to the 
“alarms” section of the alerts.cfg file. 


Radmind [2] takes a file based approach to con- 
figuration management by integrating intrusion detec- 
tion with centralized system management. An advan- 
tage is that complex, out-of-band changes can be cap- 
tured and incorporated into the configuration, but 
maintaining consistency of configuration data that is 
duplicated across multiple files may present a chal- 
lenge without factoring tools. 


ISconf [9] is a highly order dependent configura- 
tion tool based on make(1) files. A description of 
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changes is provided to the tool, and it ensures that 
those changes are carried out on each system in an 
exact order. [Sconf version 2 requires that the descrip- 
tion be created and maintained manually, whereas later 
versions provide more automated ways for generating 
the description. The basic premise of ISconf is that 
“order matters”: it is easier to replicate the order in 
which operations are conducted than it is to determine 
and accommodate the interactions of those operations. 
Like newfig, ISconf implements congruence. The dif- 
ficulty with ISconf is the monotonic increase to the 
description and the steps required to recreate a system. 
As changes are piled on top of changes, the time 
required to build or rebuild a system continues to 
increase. Although preservation of the order of 
changes is sufficient to achieve congruence, it is our 
belief that it is not necessary. 


Design 


Newfig is a system designed to provide for the 
automatic configuration of individual machines from a 
common description. Boolean algebra is used to con- 
trol the generation of output to a number of channels. 
Each channel can be used to control the contents and 
characteristics of a file. Channels can also be used as 
input to scripts for operations which are more compli- 
cated than basic file construction. All channel defini- 
tions are part of the configuration, allowing the func- 
tionality of newfig to be extended with ease. Newfig is 
designed to be idempotent, transportable, extensible, 
conformant (rather than convergent), and fail-safe. 


The configuration consists of a series of boolean 
phrases interspersed with output statements. Boolean 
algebra is used as the logical structure for the newfig 
configuration language. Clauses are used to infer the 
logical value of a symbol from other symbols. The 
algebra supports the three basic logical operations: 
and, or, not. Parentheses are also recognized for 
grouping operations. Between the boolean clauses are 
statements that send lines to channels. Each channel 
must be explicitly defined in the configuration along 
with its characteristics. A channel can be associated 
with a file, in which case its contents becomes that of 
the file. A channel can also be associated with an 
external command or script, in which case the script is 
used to process the channel’s contents. External com- 
mands are also used to perform syntax and semantic 
checks of channels’ content to ensure correctness. 


Processing is performed in several distinct 
phases in newfig: read, intrinsic definition, inference, 
macro definition, generation, filtering, instantiation. 
No changes are performed on the system until the 
instantiation step, giving newfig ample opportunity to 
discover problems before changes are made. If any 
problems are detected before instantiation, newfig will 
be fail-safe and not make any changes to the system. 


Decisions about a system are solely dependent 
on the boolean clauses in the configuration; the role of 
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the first few phases is to evaluate the clauses. First, the 
entire configuration is read in and parsed. Then certain 
facts about the system are used to determine a set of 
intrinsically true symbols. The name of the system is 
one such symbol: it is always true. Additionally, the 
following symbols are intrinsically true: the name of 
the operating system (in all lower case), the name 
combined with the operating system release, and the 
platform type. Thus, a system named sammy running 
Solaris 9 on a SPARC platform will have the follow- 
ing intrinsically true symbols: sammy,  sunos, 
sunos-5.9, sparc. Another intrinsically true symbol 
represents the network of the system’s IP address (or 
addresses). For example, a system with the address 
10.5.2.12 would have the symbol net.10.5.2 defined as 
intrinsically true. The configuration can also invoke 
external commands to augment the set of intrinsic 
symbols with either true or false values. 


Once the intrinsics have been determined, the 
values of the other symbols in the configuration are 
determined through inference. Not every symbol’s 
value can be determined. When inference is complete 
there will be a set of symbols known to be true, a set 
known to be false, and a set of symbols whose values 
are unknown. For all the remaining phases, the only 
symbols which matter are those known to be true. 


The configuration language allows for the defini- 
tion and expansion of macros in the statements and the 
channel declarations. In the macro definition phase, 
the values for all macros are determined. A special 
append operator (+=) is available to add to an existing 
macro definition, such as a PATH. 


The generation phase creates the content of 
every channel. This is done by processing the output 
statements associated with true symbols. 


Once the contents of each channel is known, syn- 
tax checking is performed by an external program as 
specified in the channel definition. Since new/ig itself 
has no knowledge of the intended content of a chan- 
nel, syntax checkers provide a way to verify that a 
channel’s content are correct. Although this phase is 
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named syntax checking, any sort of checks can be car- 
ried out by an external program. Examples of checks 
which could be performed are: ensure that each line of 
a channel has the correct number of fields, ensure a 
passwd channel has a line for root, verify that every 
program listed in inetd.conf exists. If any of the syntax 
programs indicates an error, then newfig will fail-safe 
and not make any changes to the system. 


A close variant to syntax checking is the filtering 
phase. This phase is similar to syntax checking, except 
that the external programs invoked in this phase alter 
the content of the channel in addition to performing 
simple syntax checking. Each program acts as a Unix- 
style filter, reading the channel contents from standard 
input and writing to standard output. The results of the 
filter are used to replace the content of the channel. 
Although the order of records in most files is unimpor- 
tant, it may be desirable to sort the output to improve 
human readability. The passwd file is commonly 
sorted by UID, and a filter could easily perform this 
task for generated passwd files. Optimization is 
another good use of filtering. A generated hosts.allow 
file may contain overlapping address ranges, and a fil- 
ter could remove them as well as reformat the output 
to improve readability. It is important to the integrity 
of the entire system that filters perform no side effects 
and that their actions are restricted to producing output 
and error messages. 


The final phase is instantiation, which is per- 
formed once all of preceding phases have executed 
without error. This phase ensures that the underlying 
system matches the configuration. For file channels, 
the file is compared to the generated content of the 
channel. If there is no difference, the file is left 
untouched (thus modification times are unchanged). If 
the channel contains different content, a new copy of 
the file is created and put into place. A file channel can 
also specify ownership and permission modes for the 
file. An action channel has its action command 
invoked with the channel contents available as standard 
input. Action commands usually perform side effects. 


Syntax & 
Filter 
Scripts 


Channel 


Generation Instantiate 


Action 


Scripts 





Figure 1: The phases of newfig. 
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A channel can have both a file and an action associated 
with it. In such cases, the action is only performed 
when the file is changed. Thus an action can be used to 
restart or “tickle” a daemon only after its configura- 
tion file has been updated by new/ig. Figure 1 show the 
basic flow of information through the various phases. 


Configuration Specifics 


Each configuration statement has two separate 
components: a logical relationship and a body of state- 
ments. A statement need not contain both components. 


Logical Clauses and Expressions 


The logical relationships are specified using a 
simple boolean algebra. The rules for symbol names 
are very generous, allowing numbers, letters, and many 
special characters. Expressions can contain and, or, 
inversion, and grouping. There are two ways to specify 
a logical relationship: a standard clause and implica- 
tion. A standard clause takes the following form: 


symbol: expression 


If the value of the expression can be determined, then 
it becomes the symbol’s value. An expression consists 
of any combination of the following elements: 


or symboll symbol2 
xor symboll ~*~ symbol2 
and symboll & symbol2 
not !symboll 


Parentheses can be used to group expressions together, 
as In: 
symbol: !(a bec d) 


Wherever a clause can be used, an expression may 
also be used. An expression is just a clause with no 
left hand side, as in: 


(Ca bie: dd 


The second relationship is implication, where a single 
symbol’s value is used to imply the value of a list of 
symbols (expressions do not make sense in this con- 
text). This takes the following form: 


symbol -> symboll symbol2 symbol3 
This is equivalent to: 
symboll: symbol; 


symbol2: symbol; 
symbol3: symbol; 


A clause may be optionally followed by a body. When 
present, this body must be contained in braces. The 
body consists of a mixture of any number of the fol- 
lowing: a statement that sends output to a channel, a 
macro definition, or an expression with its own body. 
A body is processed when the symbol or expression 
with which it is associated is known to be true. 


The symbols a// and true are preassigned the 
value of true, and the symbol fa/se is preassigned the 
value of false. Other symbol values are determined by 
intrinsic settings. Symbols which appear in the config- 
uration only on the right hand side of a clause (or only 
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on the left hand side of an implication) are called ter- 
minal symbols. The value of a terminal symbol cannot 
be determined by anything in the configuration. Ter- 
minal symbols which do not have an intrinsic value 
are assumed to be false. 


Macros 


Macros can be defined and expanded much like 
Makefile macros, although all macros are created 
before any statements (thus any expansion) takes 
place. Macros are created as follows: 


MAC=value; 


There is a special append operator to allow definitions 
to be augmented: 


MACt+=value; 


When expanded, such macros will have spaces sepa- 
rating each added component. Expansion is invoked 
with a dollar sign followed by the macro name in 
braces, for example: 


S {MAC} 


Some macros have special meaning. The PATH macro 
is used as the execution path for all commands 
invoked by newfig. HOSTNAME is preset to the name 
of the current host. Unlike other statements, macro 
definition can appear outside of a clause. 


Much like make(1), all macros are defined before 
any are expanded. Thus definition need not precede 
use in the configuration. The order in which macros 
are defined is only significant in two cases. First, if a 
macro is redefined then the last definition will be 
used. Second, the order in which macro append defini- 
tions are processed will determine the order in which 
the strings appear in the final macro. Currently there is 
no way to directly control the order in which defini- 
tions are processed other than basic sequential order. 


Channel Statements 


Channel statements are very simple. The state- 
ment begins with a channel name, is followed by any 
number of fields, and ends with a semi-colon. When 
processed, all fields are written to the named channel, 
and separated by a single space. Here are some exam- 
ples of channel statements: 

resync /opt/proftpd; 

hosts.allow “ALL: 127."; 

inetd.conf telnet stream tcp nowait 
root /usr/sbin/in.telnetd in.telnetd; 


The usable characters in an unquoted field are limited. 
Quoted strings are written exactly as they appear after 
macro expansion. A dollar sign can be represented 
with two dollar signs. In the second example, the 
macro MAC is not expanded: 

"string ${MAC}" 

"string SS{MAC}" 


A special form of channel statement is the %include 
statement. This allows the entire contents of an exist- 
ing file to be included in a channel. For example: 


%include passwd /etc/base.passwds; 
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Clause and Body 


The entire clause with its body must end with a 
semi-colon. Thus any of the following are valid 
clauses: 


symbol: !(a bc d) 
{ 
channel line; 
Ys 
symbol: e & f; 
symboll: 
{ 
channel line2; 
his 
symbol2 
{ 
channel line2; 
} 5 
A statement or a macro definition within a body may 
be guarded with a boolean expression. Thus the fol- 
lowing two clauses do the same thing: 
x {'» { channel aij: 33 
x & y { channel a; }; 


The first form is more useful when the body has a 
number of statements, only one of which needs to be 
guarded, as in: 
et 
eni. a; 
ch2 b; 
v1 GAS es B: 
3 
In this form, chl and ch2 receive output whenever x is 
true, but the ch3 statement is only processed when 
both x and y are true. This can be useful in situations 
where certain statements are needed only for particular 
operating systems. 


If a clause defines the value of a symbol, then 
any body associated with the clause is performed 
whenever that symbol is true, even if it became true 
via a different clause. Consider the following example: 

x3 a@ D 

{ output linel; }; 

xs ed 
{ output line2; }; 
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Note that both lines are sent to the output whenever 
the symbol x is true. More specifically, if a is true 
while b, c, and d are false, then the symbol x is 
asserted to be true and both lines are sent to the output, 
even though the second expression (‘‘c d”’) is false. 


Channel Definitions 


There are no predefined channels. All channels 
must be explicitly defined in the configuration. This 
definition is done with a %channel clause. Following 
the keyword “channel is the channel name and then, 
in braces, its definition. For example: 

*%channel hosts { 
file /etc/hosts; 
A channel may have a number of characteristics as 
declared in the body of the channel definition. How- 
ever, a channel may only be defined once. 


A channel may take on several forms, sometimes 
in combination. A channel that is associated with a 
file, or “‘file channel,’ ensures that the file contains 
exactly what appears in the channel. During instantia- 
tion, the contents of the channel are compared to the 
file, and if they differ, the file is rewritten to exactly 
match the channel. If they are the same, the file is left 
untouched. Mode and ownership are also compared 
and corrected where necessary. 


An “action channel” has an action associated 
with it. The action is an external program that is run 
during the instantiation phase and uses the channel 
contents as standard input. It is expected that an action 
program will perform side effects. 


A channel may be both an action channel and a 
file channel. In these cases, the action is only performed 
when the file is changed. This form is useful for updat- 
ing daemon configuration files, as the action can ensure 
that the daemon is signaled or restarted. 


A directory may be associated with a channel to 
form a “directory channel.” Such a channel is 
expected to contain a list of files (one per line). The 
target directory is checked to ensure that it contains 
the listed files and nothing else. The contents of those 
files is checked and updated. Consider the following 
series of statements where “‘xinet”’ is a directory chan- 
nel for /etc/xinetd.d: 


action external action program 
target directory for directory channel 


directory 


file target file for file channel 
filter external program to check and augment channel contents 


syntax external program to check channel contents 

owner required file owner (for file and directory channels) 
group required file group (for file and directory channels) 
mode required permissions (for file and directory channels) 


singlesource 


channel contents may only come from one body 


after channel must be processed after another channel 





Figure 2: Channel properties. 
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xinet /opt/proftpd/xinetd/proftpd; 
xinet /opt/rsync/xinetd/rsync; 


Each of the named files will be copied in to the target 
directory, and newfig will ensure that no other files 
exist in that directory. 


Channel Properties 


Figure 2 shows properties that can be set for each 
channel. 


External Programs 


The filter, syntax, and action properties invoke 
external programs, either binaries or scripts. The inter- 
face for all of these programs is the same. The con- 
tents of the channel is readable on standard input. Pro- 
grams that exit with a zero status code indicate normal 
success, a non-zero exit code indicates that an error 
occurred. Lines written to standard error are consid- 
ered to be error messages. If formatted correctly then 
newfig will be able to match the message up with the 
line that caused it. The error message format is a line 
number followed by a colon then the message. 


Filter programs must write a revised copy of the 
channel contents to standard output. These results will 
be taken as the new channel contents. Error messages 
generated by filters are treated the same as the ones 
generated by a syntax program. Note that, in both 
cases, the contents of standard error is ignored unless 
the program exits with a non-zero status code. 


In order to preserve the idempotent characteristic 
of newfig, syntax and filter programs should not pro- 
duce any side effects. This includes modifying, creat- 
ing, or removing files, and signaling, stopping, or 
starting processes. These programs can, however, use 
temporary files provided they are removed when the 
program finishes. The action programs are expected to 
produce side effects: that is their purpose. Although 
the channel contents is made available on standard 
input, an action program need not read it. 


Interesting Properties 


Newfig has a number of interesting properties 
that make it a strong utility. These features are gener- 
ally desirable in a tool that automatically adjusts a sys- 
tem’s configuration and behavior. 


Idempotency 


An operation that is idempotent is one that acts as 
if it was only invoked once even if it is invoked multi- 
ple times. An idempotent operation does not have a 
cumulative effect. Newfig is designed to be idempotent, 
but its ability to retain this characteristic is entirely 
dependent upon the action commands that it invokes. 


samba { 
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The intrinsic symbol values for a system are 
entirely dependent upon characteristics of the system 
itself: its platform, operating system, and IP address. 
As long as these remain constant, the intrinsic symbols 
will always have the same value. However, it should 
be noted that the system does provide an external 
mechanism for augmenting the intrinsic set. If this 
mechanism does not provide a constant set of results, 
the idempotent behavior may be compromised. 


As long as the intrinsic set remains constant 
between invocations, the results of inference will 
always be the same. This is insured by the boolean 
algebra which drives the inference phase. The set of 
true and false symbols derived from inference is the 
only means of selection ‘for the remaining phases, 
guaranteeing that their results will always be the same. 
The only exception to this is the use of external pro- 
grams for syntax, filtering, and action. Both syntax 
and filter commands are required to have no side 
effects, thus they can easily be idempotent. This leaves 
the action scripts. Newfig performs idempotent opera- 
tions if and only if the action scripts it invokes also 
perform idempotent operations. 


Transportability 


The term transportability is being adopted to 
describe a characteristic of newfig that is unique to an 
automated configuration tool. A tool that is trans- 
portable is one that is capable of creating the same 
result regardless of the actual system on which it is 
run. Transportability is essential for the testability of a 
site’s configuration. In order to perform regression 
tests on an infrastructure configuration, the testing 
mechanism must be able to determine the results of 
the configuration tool for a variety of systems (per- 
haps all) in the infrastructure. Without transportability, 
the generation of this data must take place on each 
system in the test set. 


Newfig provides transportability through its strict 
use of boolean algebra to drive all the decisions. 
Assume that both the newfig configuration files and the 
external programs used by newfig are constant (func- 
tionally equivalent) across the infrastructure. All that 
newfig needs to be able to generate the output of each 
channel for a given host is the intrinsic set and the list of 
predefined macros for that host. Newfig provides mech- 
anisms for generating just this information and using it 
instead of the local set. This provides transportability. 


Extensibility 


One of the problems with systems such as 
CFengine is the limited range of capability. In fairness, 


inetd.conf netbios-ssn stream tcp nowait root /opt/samba/bin/smbd smbd; 
inetd.conf netbios-ns dgram udp wait root /opt/samba/bin/nmbd nmbd -d2; 


resync /opt/samba; 
bs 
alpha -> samba; 


Figure 3: Sample Samba configuration. 
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CFengine has a great deal of functionality already 
built in, but its ability to extend that functionality is 
limited. Newfig is designed to be extensible through 
extensive use of external commands. These commands 
may be anything that can run on the native system: 
scripts, perl, python, pre-compiled binaries, etc. There 
is very little capability actually built in to newfig. The 
design philosophy is similar to that of the original 
Unix: newfig is a tool which can easily be used as a 
building block for other tools. 


Conformance 


The configuration for newfig is a complete 
description of the files which it controls. Rather than 
providing a series of steps to edit an existing file, the 
configuration provides enough information to recon- 
struct the entire file. This approach ensures that the act 
of removing data from a file’s description will get 
implemented correctly on the affected machines. 


Some automated configuration systems, such as 
CFengine, provide convergence toward an ideal. In 
some environments, especially those with loose control 
over root access, convergence is an appropriate tool for 
effective centralized management. However, installa- 
tions which primarily consist of servers and where 
configuration changes are typically co-ordinated by a 
single organization do not need the flexibility of con- 
vergence. It is simpler to provide a complete descrip- 
tion of a configuration and enforce conformance to it. 
This ensures that changes are implemented completely 
and with a single iteration. A complete configuration is 
also descriptive in what it does not contain, making the 
removal of unnecessary items simpler. 


Consider a system that is configured to include a 
samba [7] server. This configuration might look like 
the one in Figure 3. 


Such a configuration would ensure that any sys- 
tem for which the symbol samba is true would have 
the lines needed for samba in inetd.conf and the direc- 
tory /opt/samba synced up correctly with a central 
repository. The last line of the configuration ensures 
that when alpha is true samba is true also. Thus the 
system alpha would have the smbd and nmbd lines 
added to inetd.conf. Other parts of the configuration 
would contain the remaining lines that are expected to 
appear in inetd.conf. Now consider what happens 
when the association between alpha and samba is 
removed, such as would be the case if it was decided 
that alpha should no longer provide samba service. 
The next time newfig is invoked, it will rebuild all the 
channels, including the inetd.conf channel. But this 
time it will build it without the smbd and nmbd lines. 
Newfig will detect that the resulting channel is differ- 
ent from the file inetd.conf and will instantiate the 
new channel contents as inetd.conf. Finally, if the 
channel definition for inetd.conf has an action com- 
mand, new/fig will run the action providing an opportu- 
nity to send a signal to inetd. The removal of this data 
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happens as a natural consequence of the information 
from the configuration itself. 


Fail-safe Operation 


The newfig design defers any modifications to 
the system until all other processing is complete. The 
soundness of the configuration is checked during read- 
ing and inference. The integrity of each channel’s con- 
tent can be checked and augmented with syntax and 
filter commands. No changes are made to the system 
until the final phase. If any problems are found prior 
to instantiation, newfig can decide not to continue. 
This provides a fail-safe mechanism to ensure that an 
incorrect configuration is not applied to any systems. 


The original design goal of newfig was to pro- 
vide an entirely fail-safe design: any sort of errors 
would prevent newfig from instantiating any changes 
and executing any action. During the deployment of 
this 100% fail-safe model we discovered that this may 
not be a desirable design. 


One obvious function that newfig can supply is 
driving the distribution of files from a central reposi- 
tory (commonly called a go/d server). For example, a 
channel can contain the names of directories and files 
which must be kept in sync with a central server, and 
the action for that channel can provide the mechanism 
which performs the syncing. Such a channel is an obvi- 
ous choice for keeping the new/fig configuration itself 
in syne. Unfortunately, if newfig is 100% fail-safe, then 
any error in the configuration will completely disable 
this mechanism, making it impossible to recover with- 
out a mechanism outside of new/ig itself. 


Experience with a 100% fail-safe system has 
made it obvious that certain types of errors need not 
hinder the update of unrelated items. As an example, 
consider a configuration which maintains password 
files and hosts.allow files. A mistake in an entry for 
hosts.allow files could be something as simple as for- 
getting the dot at the end of a network pattern (such as 
“172.16.1”" without the trailing dot). A properly writ- 
ten syntax command will catch that mistake and cor- 
rectly flag it. However, a 100% fail-safe design will 
also prevent the passwd file from being updated, even 
though it has nothing to do with hosts.allow. 


A better design would be to contain the failure, 
failing only what is affected. In the case of hosts.allow 
it would be contained to just its channel and none 
other. If the configuration states a dependency 
between channels, such as the after property, then fail- 
ure of a channel should also cause failure of any chan- 
nels that depend on it. Newfig can easily be adopted to 
fit this more limited idea of fail-safe. 


Declarative Language 


The configuration language is designed to be 
declarative. As a result, the order in which files are 
processed, and the order in which statements and 
clauses appear in the file do not matter. This allows 
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the maximum amount of flexibility for organization of 
the configuration information. Many of the files of 
concern do not depend on the ordering of lines, some 
important files do have this requirement. In order to 
accommodate this, certain guarantees are made about 
the order in which output statements are processed, 
thus affecting the order of the lines within the chan- 
nels. Clauses can be processed in any order, but state- 
ments within the body of a clause will be processed in 
the order they appear. Line ordering in an included file 
will be preserved. Consider the following example: 

xe {ourpun ay 1: 

eros ogepgt bs hs 
There is no guarantee that the output will be ordered a, 
b. Although generally this will be true, newfig makes 
no guarantees and configurations should not rely on it. 
However, in the following example it is guaranteed 
that the lines will appear a followed by 5, since both 
statements appear in the same body: 

xs f 

output a; 


output Db; 
3 


Examples of Practical Applications 


Examples of some common applications for 
newfig are as follows. 


hosts.allow 


Access to services controlled by tcp wrappers 
[11] is set with the use of hosts.allow. This file can be 
controlled by newfig, allowing for the creation of any 
arbitrary hierarchy to define allowable access. The 
channel definition would be: 
*%channel hosts.allow { 
file /etc/hosts.allow; 
filter allow-filter; 
Owner root; 
mode 444; 
Ls 


Although it is optional, a filter for processing this 
channel provides improved results. The filter can opti- 
mize the channel contents to ensure that there are no 
extraneous specifications in the channel. Consider the 
following usage of this channel: 


a1 1: fvhosts: allow lAbL: 4.052 2.5"%: }; 
beta tchosts sallow ALG: 10.2 2..." 4 


For host beta both lines will appear in hosts.allow. A 
channel filter would be able to detect that the first 
entry is extraneous and remove it. The filter can also 
collapse multiple lines for the same daemon (or for 
ALL) in to a single line. 
Services 

Something as simple as /etc/services can easily 
be maintained by newfig. It’s channel definition only 
needs to provide the link between the channel and the 
file. If desired, a syntax checking step can be written 
and added to the channel definition: 
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*channel services { 
file /etc/services; 
syntax services-syntax; 
owner root; 
mode 444; 
+ 
This channel can be used in the configuration to add 
lines to services, as follows: 
rsyncd: { 
services "rsync 873/tcp"; 
services "rsync 873/udp"; 
hs 
Since newfig generates files from scratch, the entire 
contents of the file must be specified by the configura- 
tion. This means that there must be a baseline of data 
available to add to the services channel to ensure that 
all the standard entries are there. Although each entry 
could be listed separately, it would be easier to include 
the baseline from a separate file (shown here with a 
broken line): 
ales 
%include services /opt/newfig/base/services; 


+ 


cron 


Each crontab file needs to be controlled as a sep- 
arate channel. For most systems, only root and a few 
system crontabs need to be managed. Since it is 
unwise to edit a crontab file directly, this channel uses 
a proxy file instead. For the best results, the proxy 
should be persistent across invocations of newfig so 
that crontab itself is only invoked when there has actu- 
ally been a change. For ease of demonstration, it is 
assumed that the macro PROXYFILES contains the 
name of a directory that holds proxy files. 


This channel definition has to define different 
actions for each operating system it supports, as there 
is wide variation in the use of the crontab command. 
The channel also invokes a syntax checker to ensure 
the channel’s contents are correct. 

*channel cron.root { 
file "${PROXYFILES}/crontabs/root": 
syntax "syntax-cron"; 
linux { action "crontab -u root -": }; 


sunos { action "crontab"; }; 
} 


Typical usage of this channel would be: 


sunos { 
cron.root "10 3 * * 0 /usr/lib/newsyslog"; 


} 


sysctl 


Linux sysctl is used to configure various kernel 
and driver parameters at runtime. The desired settings 
are kept in the file sysctl.conf and the command sysctl 
is run to set the parameters. Newfig can easily be used 
to drive the generation of sysctl.conf and provides cen- 
tral control over these settings. If a system is config- 
ured to be a web server, then its sysctl.conf can contain 
the settings needed to provide maximum performance. 
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If it is then changed to run a database server, newfig 
would alter the contents of sysctl.conf as described by 
its configuration to use the appropriate settings. 


A sysctl channel for linux would probably look 
like this: 


*%channel sysctl { 
linux { 
file /etc/sysctl.conf; 
action /sbin/sysctl -p; 
is 
hs 
The linux conditional is not strictly necessary, but 
does allow greater flexibility in the use of the channel. 
On non-linux systems, the channel contents will be 
ignored rather than generating an error. 


Symbolic Links 


Although newfig does not provide any built-in 
mechanisms for managing symbolic links, adding the 
functionality is as easy as writing a script. All that is 
needed is an action script that reads pathname pairs from 
standard input and ensures that one is a symbolic link to 
the other. This is a simple script to write, and it can be 
made as elaborate as necessary. It would also be benefi- 
cial for the script to have a syntax checking option. 


An example channel definition for handling sym- 
bolic links is as follows: 
%channel symlink { 
syntax "link-action -s"; 
action “link-action": 
x 
This channel would be used as follows: 
sunos { 


symlink /etc/inet/hosts /etc/hosts; 
rs 


Management of an application’s multiple versions can 
be handled via symbolic links as follows: 
prottpd-1.2.9: { 
symlink /opt/proftpd-1.2.9 /opt/proftpd:; 


resynce /opt/proftpd-1.2.9; 
) ; 


prottpd-1.2:10re3:: { 


symlink /opt/proftpd-1.2.10rc3 /opt/proftpd:; 


resynce /opt/proftpd-1.2.10rc3; 
i; 


File Distribution 


Newfig does not have a built-in file distribution 
mechanism, not even for its own configuration files. 
The focus of this tool was on automatic configuration, 
so it relies on other means to accomplish file distribu- 
tion. Newfig can easily be used to drive a file 


ftp-standalone: { 

re /opt/ftp/re/ftpd: 
- 
ftp-inetd: { 
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distribution and synchronization mechanism. For the 
sake of example, consider a script, named resync, that 
takes a list of directories and files on its standard 
input. Each entry is synced up with a central server 
using rsync [10] (recursively for directories). The fol- 
lowing channel can be used to feed the data to resync: 


*channel resync { 
action resync; 


}3 


all: { resyne /opt/local; }; 
samba: { resync /opt/samba; }; 


Inet Daemon 


The typical inet daemon is controlled through the 
file /etc/inetd.conf. However, some systems (such as 
Linux) have a more sophisticated inetd that is config- 
ured through a collection of files in a directory, typi- 
cally /etc/inetd.d. An infrastructure with a mix of these 
systems can still be controlled with newfig but the two 
types of inet daemons must be configured separately. 
This example uses a directory channel and a null chan- 
nel. A directory channel contains a list of files and 
ensures that the target directory contains each of the 
named files and only those files. The files can origi- 
nate anywhere but will bear the same names in the tar- 
get directory. Here are two definitions, one for sunos 
(which uses the traditional inetd) and one for linux: 


*%channel inetd { 
sunos { 
file /etc/inetd.conf; 
owner root; 
mode 644; 
syntax "inetd-syntax"; 
action "pkili,.-i -u root inetd"; 
1; 
} ; 
*channel xinetd { 
linux { 
directory /etc/xinetd.d; 
action “pkill -I\=-u-root xinetd"; 
} ; 
1} 
Typical usage would look like this: 
rsynecd: { 
inetd rsync stream tcp nowait 
root /usr/sbin/in.tcpd 
/usr/bin/rsyne --daemon; 
xinetd /opt/rsync/etc/xinetd/rsync; 
hs 
Note that the body of the rsyncd clause specifies data 
for both inetd and xinetd. However, it does not need to 
distinguish between different system types. That dis- 
tinction is made in the channel definitions, not their 


inetd "ftp stream tcp nowait root /opt/ftp/sbin/ftpd ftpd -a"; 


b; 


Listing 1: ftp multiplexed between stand-alone and inetd service. 
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usage. Thus any system which supports xinetd can be 
added to the channel definition. 


Differing FTP Usages 


Because new/fig evaluates logical clauses declara- 
tively rather than procedurally, it is able to detect con- 
flicts between clauses. For example, the following two 
clauses are conflicting and are flagged as an error: 

one: two; 

two: !one; 
This feature can be used to protect a configuration 
from implementing conflicting configurations. One 
example of this ftp: it can be used either from within 
inetd or as a stand-alone daemon. Sometimes there are 
good reasons to use both types of configurations 
within the same infrastructure. New/fig can handle this 
and can also guard against accidentally attempting to 
implement both methods on the same system. 


Assume that a configuration has channels to sup- 
port the configuration of inetd.conf and boot time 
“re” scripts in the style of System V. Listing 1 shows 
how ftp can be defined to act as stand-alone in some 
cases and as an inetd service in others. 


Individual machines are configured to request 
either of these two symbols: 


serverl -> ftp-standalone; 
server2 -> ftp-inetd; 


To prevent a misconfiguration from setting both sym- 
bols for the same server, the following statement can 
be used: 


false: ftp-standalone & ftp-inetd; 


Since the symbol fa/se is always false, this statement 
will cause a logic conflict whenever both ftp-stand- 
alone and ftp-inetd are true. 


Deployment at CNN 


We began roll-out of newfig in to the CNN web 
farm infrastructure earlier this year. The web farm 
consists of over 800 hosts running a mix of Solaris 
and Linux. Prior to the use of newfig, we had con- 
structed a system to control file distribution utilizing 
rsync. A central repository (gold server) holds all the 
files that need to be distributed for the various plat- 
forms we support. The resync script uses information 
about the host to determine a list of directories that 
must be kept in sync, then uses rsync to ensure that 
they are. This system, part of an effort called Unity, is 
used to distribute binaries and configuration files for 
key services in our infrastructure, including web ser- 
vice and ftp service as well as patches. 


The deployment of newfig was built upon the 
success of resync. The initial configuration for newfig 
completely replaced the functionality of the original 
resync, and its deployment was seemless. Once newfig 
was in place, we targeted hosts.allow as our first file to 
control: it is relatively simple to support but requires 
significant fine-grained control. 
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A few weeks of effort was spent collecting the 
existing settings from across the infrastructure and 
codifying them in the newfig configuration language. 
Our initial goal was to automatically generate hosts. 
allow files that were no more restrictive than the ones 
already in place. In may cases the resulting files were 
more generous. Testing of this configuration was 
accomplished by generating the file as /etc/hosts.allow.x, 
then comparing the results to the existing hosts.allow. 
We eventually reached a point where the host access 
being removed could be summarized in a short list, 
and we decided that each of the items on the list was 
acceptable or even desirable. Then we changed the 
configuration to generate the actual hosts.allow file. 
Of the nearly 800 machines in the infrastructure, we 
only experienced access problems with one. 


The resulting configuration was approximately 
2400 lines spread out across 52 files, plus four sepa- 
rate files utilized in include statements. A filter script 
was written to organize and optimize the channel 
results as well as check for errors. We found many 
errors in the pre-existing hosts.allow files which had 
gone undetected, usually a class C network specifica- 
tion with no trailing dot. The next goal for hosts.allow 
is to analyze the current access and determine what is 
still needed and what should be removed. We expect to 
dramatically simplify the final configuration by ratio- 
nalizing host access across large groups of systems. 


The hosts.allow experiment was sufficient to 
prove the viability of the project. With its success we 
intend to take control over the hosts file, the startup 
scripts, crontab files, inetd.conf, boot time configura- 
tions (especially default routes), and perhaps passwd 
and shadow. The hosts file poses a unique challenge. 
Our hosts are divided in to internal and external, 
depending on whether or not they can be accessed 
directly by the outside world. All external hosts 
receive a common hosts file, but internal hosts use 
DNS. We have found it desirable to use minimal hosts 
files on the internal hosts: ones that only contain infor- 
mation on the host itself, and the NIS and NFS 
servers. We want newfig to generate this file from a 
list of hostnames, ensuring that the IP addresses are 
always correct and providing central control over the 
hosts that appear in the file. 


Future Work 


Our experiences with newfig are still very lim- 
ited. As the support staff becomes more accustomed to 
its use, we plan to extend its control to as many facets 
of our systems as is feasible. 


One set of files we would like to control with 
newfig are the passwd and shadow files. Currently we 
use NIS for portions of our infrastructure and nothing 
but local files for the remainder. Controlling access by 
system or by groups of systems is easy with NIS, but 
extremely difficult with static files. The tradeoffs with 
NIS are well known, including unsuitable security and 
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a poor level of robustness. We could replace NIS with 
a system based on LDAP, but in order to reduce single 
points of failure we do not want any of our externally 
facing servers dependent on a central service. Newfig 
would provide us with the centralized control that we 
need, but there are security issues that need to be con- 
sidered for the distribution of the shadow information. 


We want to control startup scripts with new/ig, 
but these pose a number of interesting problems. 
Startup scripts differ widely among systems, requiring 
either separate channels or a single channel loaded 
with extra information. Effective control of the startup 
scripts would also require automatic stopping and 
starting of the daemons those scripts control as scripts 
are added and removed. Our goal is to control all 
startup scripts, so that those scripts which are not 
needed simply will not be included in the configura- 
tion. This goal is complicated by operating system 
vendor patches that alter these scripts. 


We have found the use of “‘net’’ symbols (such as 
““net.10.1.2”) to describe a system’s local network very 
beneficial for many aspects of system configuration. 
For example, the symbols are the basis for determining 
if a system is “internal” or “external” (the latter are 
accessible by the outside world). However, the current 
implementation is very limited, and assumes that net- 
work divisions all fall on the traditional class C bound- 
ary [4]. The usefulness of these symbols could be 
enhanced by extending the notation to something more 
general, especially something that includes a netmask. 
But the simple boolean logic makes this more difficult. 
Further thought in this area would be beneficial. 


Newfig provides a natural way to implement con- 
formance for data in files. When a configuration 
change requires lines to be removed from files which 
are controlled by newfig the changes happen automati- 
cally as a result of file construction. However, this 
characteristic does not extend to the side effects imple- 
mented by action scripts, such as the symbolic link 
example in 6.5 and the file distribution example in 6.6. 
In these cases, newfig is constructing the input to a 
process which generates side effects. Consider the case 
of symbolic links with the following example: 

one: alpha { 


symlink /opt/etc/one /etc/one; 
Ms 


The first time newfig runs, host alpha will have 
the symbolic link /etc/one created. If alpha is removed 
from the definition of the symbol one then the output 
line will no longer appear in the symlink channel. 
However, the symbolic link will remain as there is 
nothing that will remove it. We have an idea on how 
this problem can be overcome and will be pursuing its 
implementation. 


There are times when it is advantageous to impose 
an ordering on output statements for a particular chan- 
nel. Rather than use syntactic rules to imply an ordering, 
we envision a way to explicitly specify interrelationships 
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between the output lines. One possibility is to allow 
for the specification of priorities on output statements, 
with a default of 100, and ensuring that lines appear in 
priority order. Thus a line that must always appear 
first could be given a priority of 1, and a line at the 
end a priority of 200. 


The manipulation of macros in the configuration 
files is well motivated but poorly implemented. They 
are not as intuitive as one would hope. Further work 
needs to be done on this concept to provide the neces- 
sary functionality in a way that is still easy to follow. 


Software Availability 


The software was developed internally at CNN. 
Its availability for public use and review has not yet 
been determined. 
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ABSTRACT 


Efficiently installing and configuring large sets of computer systems is an important concern 
for system and cluster administrators. Current solutions usually follow one of the two approaches: 
an image-based install or a metadata-based custom install. Both approaches limit the opportunities 
for optimizing the installation time by coupling the system specification with the installation 
technique and ignoring the relationships between configurations over time (as they evolve with 
patches and new packages). 


The Adaptable Installation System (AIS) is a new model and implementation that attempts to 
address these shortcomings by taking a hybrid approach to client system installation. As in the 
metadata-based approach, it uses descriptors to express what the final system should look like in 
terms of composition and configuration. At the same time, it uses imaging for part of the client re- 
installation to achieve speed. In this paper we present the design and implementation of AIS along 
with details on the algorithm that builds images and performance results of running the prototype 


system on a set of RedHat based machines. 


Introduction and Motivation 


Efficiently installing and configuring large sets 
of computer systems is an important concern for sys- 
tem and cluster administrators. Numerous programs 
that facilitate automated and unattended installations 
have been created. Generally they follow one of the 
two approaches: an image-based install or a metadata- 
based custom install. Both techniques are widely used 
and effective in certain scenarios. However, they both 
have two main disadvantages: 





Speed of Number of 
Installation Configurations 
1 Nota factor Small 
(2 Nota factor : a) Small. a | 
3 Nota factor ! Large 
4 Not a factor ore Large. aes 
5 Factor Small 
7 Pacts Lae | 
8 Factor =tisésLarge 


Installation Factors 


1. They couple the specification of the system 
(golden-server, Kickstart or other configuration 
file) with the installation technique (over-net- 
work disk-image writing, package or OS install 
tool). This limits opportunities for algorithmic 
optimization and prevents using the best speci- 
fication with the best technique in all cases. 

2. They both assume a static model of system 
installation, meaning they assume installation is a 
singular, fairly heavy-weight event. Although 
some adaptation to more dynamic environments 





Optimal 
Available Server Installation 
storage Approach 
Not a factor Custom, Image 
: Pie Factor mies je Custom 
Not : factor Custom 
Nota Per Image 
Po actor 4% ATS 
Not : factor | AIS 
ie AS 


Table 1: Factors that affect installation scenarios. 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


105 


AIS: A Fast, Disk Space Efficient “Adaptable Installation System” ... 


such as Clusters-on-Demand [18] is possible, 
they do not capture the relationship between sys- 
tem configurations over time. 


Table 1 compares the applicability of the two 
approaches with respect to three typical considerations 
in large installation/cluster management. The right-most 
column indicates the preferred installation approach 
given the values of the three considerations that are 
shown in the columns to the left. When the speed of 
client re-installation is not critical, custom install is a 
better option because of its ability to scale to a large 
number of system configurations and modest storage 
requirements. A new system configuration can be sup- 
ported by creating a corresponding metadata descriptor, 
i.e., a Kickstart file. Imaging remains a preferred 
approach when the speed of client re-installation is crit- 
ical. However, as the number of configurations increase 
dramatically, it becomes time consuming to manage 
and requires large amounts of storage on the server. 


Both imaging and custom installation approaches 
fall short of simultaneously supporting fast client re- 
installations, large number of system configurations 
and moderate storage usage. The Adaptable Installa- 
tion System (AIS) is a new model and implementation 
that attempts to address these shortcomings by taking 
a hybrid approach to client system installation. As in 
the metadata-based approach, it uses descriptors to 
express what the final system should look like in terms 
of composition and configuration. At the same time, it 
uses imaging for part of the client re-installation to 
achieve speed. 


The need to support a large number of system 
configurations while simultaneously allowing fast 
client re-installations is motivated by the development 
of “utility” computing and “clustering on demand” 
services such as the Oceano [4, 10] and Cluster-On- 
Demand [18] projects. These projects’ business model 
is that an organization manages computer clusters on 
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System Configuration 
A description of the composition of computer sys- 
tem software in terms of the packages installed on 
this system. In the case of an RPM-based distribu- 
tion of Linux, a list of all RPMs installed on this 
system. 

System Configuration Descriptor (SCD) 
An XML file that describes system configuration 
at a class level. Configuration is described in terms 
of Installation Base and additional packages that 
make up the system. 

Installation Base 
A minimal, working system configuration. Typi- 
cally consists of a kernel, C libraries, and a small 
set of common Unix programs. 

Host Configuration 
System Configuration, plus any modifications to 
operating system configuration files needed to 
achieve the desired host state. Host configuration 
can be achieved manually, by copying host-spe- 
cific files from the AIS server, or by running tools 
such as Cfengine [6]. 

Host Configuration Descriptor (HCD) 
An XML file that specifies host configuration 
details. It references the SCD on which this host 
configuration is based. 

AIS Server 
The machine executing the AIS server code that 
calculates the contents of each image and builds 
the cached image files. This machine also hosts the 
image cache files and package files required for 
client installation. 


Table 2: Definitions of system components. 


behalf of its clients (whether internal or external) and 


guarantees an agreed upon level of performance. It 
maintains a large pool of client machines and dynami- 
cally reallocates them from one virtual cluster to 
another as necessary. A second motivation is the con- 
tinued need to patch or upgrade, and occasionally 


Phase 4: Post-image 


Phase 3: Disk imaging configuration 


“Image file> 


Client Node 


Figure 1: Stages in installation process. 
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downgrade, production servers. This cycle causes a 
continually increasing set of configurations that differ 
only by a few package versions. To maintain them as 
online images would require continually growing stor- 
age and management complexity. 


</mageConfiguration> 
<InstallBase>fcl</InstallBase> 
< OsBaseAlterations/> 
<I/nstallations> 


<package type="rpm">mozilla-1.4.1-17</package> 
<package type="rpm"> other package</package> 


</Installations> 
</ImageConfiguration> 


Figure 2: Sample system configuration descriptor. 


<HostConfiguration> 
<ScdId>Configuration 1</ScdId> 
<ConfigFiles> 
<file perms=“‘644”’>/etc/hosts</file> 
<file perms=*‘644”’>/other/file</file> 
</ConfigFiles> 
</HostConfiguration> 


Figure 3: Sample host configuration file. 


AIS Operation 


To better understand how AIS operates, it is 
helpful to know the phases involved in client installa- 
tion. They are shown in Figure 1. The terms that are 
used in the section are defined in Table 2. 


As seen is Figure 1, AIS operates on two fronts: 
the installation server and the client machine being 
installed. On the AIS Server, AIS maintains a reposi- 
tory of System and Host Configuration Descriptors 
and a repository of installation packages. With this, 
AIS can reconstruct on disk any system configuration 
using its descriptors. This is shown as Phase | of the 
overall process in Figure 1. Note that AIS does not 
store or recover application data files except for host 
configuration contained in the HCD. The data files can 
be maintained on network file servers or separate data 
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disks. One such method is described in [17]. Further- 
more, AIS creates and maintains a repository of 
images (Phase 2), which it uses as the first step in 
reinstalling a client machine. This step is depicted as 
Phase 3 of the overall process. 


The key to AIS operation is that it does not nec- 
essarily store images for all system configurations. 
Furthermore, an image does not have to correspond to 
any specific configuration. Instead, the content of the 
image cache is determined by an algorithm and the 
same image can be used for initiating a re-installation 
of more than one system configuration. The algorithm 
aims to minimize the client installation time. In the 
decision making process it considers all system con- 
figurations that might need to be installed and the disk 
space available for storing image files. Since the 
image does not necessarily correspond to an exact sys- 
tem configuration, AIS performs the necessary post 
imaging steps to arrive at the exact system configura- 
tion. This step is depicted as Phase 4 in Figure 1. 


On the client, AIS is a script that runs as soon as 
the machine boots. The script uses a third party imag- 
ing tool, Frisbee, to retrieve the image from the AIS 
Server and write it to disk. After this, the script per- 
forms additional package installations, as necessary, to 
achieve the desired system configuration. Finally it 
performs any host specific configuration. 


The following sections provide additional details 
about AIS operations on the Installation Server and 
the client machines. 


AIS Operations on the AIS Server 
Overview 


The content of the image cache depends on all 
the System and Host Configuration Descriptors AIS is 
managing. The objective is to maintain such a combi- 
nation of images that it minimizes the time it takes to 
install the next client machine, while meeting the disk 
space constraint. In the context of a large number of 
system configurations it is not feasible to store an 
image for every configuration. Thus, the images that 
are maintained do not necessarily correspond to a par- 
ticular system configuration. Instead, an image can be 
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Figure 4: Adding a new system configuration descriptor requires image cache to be updated. 
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a mixture of common components from various con- 
figurations. This explains why in the post-image con- 
figuration phase AIS may need to install additional 
packages — to arrive at the desired system configura- 
tion from a generalized image configuration. 


System and Host Configuration Descriptors 


For AIS to manage a system configuration, a cor- 
responding System Configuration Descriptor (SCD) 
must be created. A sample SCD is shown in Figure 2. 
A SCD specifies configuration of a system only in 
terms of the installed packages. Each SCD references 
an InstallBase, which is a minimal working system. The 
packages that make up the Installation Base plus the 
packages listed in the Installations section of an SCD 
constitute all the packages for a given system configu- 
ration. A system configuration specified by a SCD rep- 
resents a class of systems. Any host specific configura- 
tion is contained in a Host Configuration Descriptor, 
HCD. A sample HCD is shown in Figure 3. 


Introducing New System Configuration 


Figure 4 shows the steps that are taken by AIS 
when a new System Configuration Descriptor is intro- 
duced. The cache content may no longer be optimal 
since when it was originally calculated the just-added 
SCD was not considered. This necessitates AIS to 
rerun the algorithm to determine the new content of 
the image cache. After the new content of the cache is 
determined, AIS needs to create the images. This is 
done by first installing the system configuration in a 
designated partition on the AIS Server and then run- 
ning an image creation program. The last two steps are 
repeated for every image file. 


The process of refreshing the image cache takes a 
considerable amount of time. After AIS determines 
how many configurations should be in the cache and 
what should be their content, each of those systems 
needs to be installed on a dedicated hard disk on the 
AIS Server in order to create an image. As such, 
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refreshing of the image cache is meant to be an off-line 
operation scheduled during times when it can complete 
before starting to server client re-installation requests. 


AIS Operations on the Client Machine 
Image and Host Customization 


On the client, AIS is a script that runs after the 
client node is booted. The steps involved in the client 
installation are shown in Figure 5. 


The script contacts the AIS Server via HTTP 
with the client’s MAC address. This allows AIS to 
uniquely identify the client and determine which sys- 
tem configuration should be installed and which image 
files should be used. The Frisbee server is started that 
serves the image file and AIS sends the client the 
command that it should execute to retrieve the image. 
After the image is written to disk, the client contacts 
the AIS Server to retrieve the list of any addition 
packages that might need to be installed to achieve the 
desired system configuration. After any additional 
packages are installed, the client script performs host 
specific configuration by copying OS configuration 
files from the AIS Server. 


Algorithm Details 


We will present a specific algorithm for con- 
structing a set of good system images to cache. Many 
other algorithms also exist and we are continuing to 
investigate them. Each potential algorithm has a dif- 
ferent set of tradeoffs. The one we present here is a 
fairly simple merge-based algorithm that maintains the 
invariant that all proposed images and targets are com- 
plete sets of packages with no missing dependencies. 
This algorithm produces results that are noticeably 
better then pure imaging or pure metadata based 
approaches, however, they are theoretically non-opti- 
mal (in the sense of minimizing the time to install any 
potential requested configuration). 


Installation Server 






Start Frisbee 
daemon to serve 
client’s image 















Figure 5: Client installation process. 
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The results of experiments conducted using the 
merge-based algorithm are further discussed later. 


Partitioning Available Space Among Installation 
Bases 


The algorithm operates on two levels. First, it 
determines how much disk space, out of all available 
for image caching, to allocate to each installation base. 
Since images are not compatible across installation 
bases, it is best that at least one image per installation 
base is available. The amount of space allocated to 
each installation base is proportional to the installation 
size of all systems configurations that rely on this 
base. For example, if the installation size of systems 
that rely on a given base is thirty percent of the total 
installation size of all systems AIS manages, then 
thirty percent of available disk space will be allocated 
for caching images of this base. 


Determining the Composition of Cached Amalgam 
Images 


Merge-based Algorithm 


The second step of the algorithm determines the 
makeup and number of images to create within the 
space allocated for each installation base. Ideally, there 
would be enough space to store an image for every sys- 
tem configuration. However, in the context of a large 
number of configurations, this may not be feasible. 


This step continuously merges the two selected 
configurations into one, until the remaining configura- 
tions fit in the allotted space. The two selected configu- 
rations are those that share a set of packages which also 
has the largest installation size. The resultant configura- 
tion after the merge is that common subset of packages. 


Once this decision step completes, it produces 
three pieces of output: 
1. A mapping of each system configuration to the 
image file that will be used in the imaging 
phase of client installation; 


AIS: A Fast, Disk Space Efficient “Adaptable Installation System” ... 


2. An SCD that describes the content of each of 
the images; and 

3. A file that lists any additional packages that need 
to be installed for each system configuration. 


If a particular system configuration was merged 
in, then there will be no image that represents it 
exactly. In this case the file will list the missing pack- 
ages. For those systems that were not merged-in, the 
file will not list any packages. 


Experimental Results 


AIS is currently implemented in Python and 
works with RPM based Linux distributions. The exist- 
ing Frisbee tools are used to generate compressed disk 
images and reliably multicast them to the client 
machines. The image times reported use Frisbee 
directly. Tables 4, 5, and 6 demonstrate the results of 
running AIS against ten system configurations. While 
this is not the large number of configurations AIS is 
intended to manage, the numbers provide strong evi- 
dence of the advantage AIS provides over other 
approaches. These tests were conducted on a set of 
X86 PC’s with Pentium3 processors and a 100 Mbps 
Ethernet network. Although not the newest generation 
of PC’s, we believe these are representative of actually 
deployed hardware. 


The configurations used in this example are all 
based on the Fedora Core | Linux distribution. They 
vary in size and composition. The second column of 
Table 3 describes the content of each configuration in 
terms of how to achieve it using the redhat-config- 
packages program. redhat-config-packages, per default 
comps.xml file in RedHat/Fedora, divides all packages 
in five global groups: Desktops, Applications, Servers, 
Development and System. Each package group is fur- 
ther subdivided into subgroups. Each packages can be 
either mandatory, default or an optional member of a 
given subgroup. In Table 3 the package group name is 


Configuration Description Image Used 
Cont 5 
Conf 7 [applications-], [servers-], [development-], [kde] |{7} 

Conte potent Drea (Servers+], [Development+], 
Cont 9 Pook oa one ener [Development+], 


Table 3: Configuration to image file mapping. 
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capitalized and followed by a plus sign only if every 
package of every subgroup is installed. The package 
group name is not capitalized and is followed by a 
minus sign if only mandatory and default packages of 
every subgroup are installed. Thus, Conf 5, for exam- 
ple, consists of only mandatory and default packages 
belonging to the subgroups in Applications, Servers, 
and Development package groups. 


Table 3 shows that after this particular run of 
AIS, due to the disc constraints provided, the image 
cache consists of only six image files, as indicated by 
Six unique entries in the rightmost column. For every 
system configuration in the leftmost column of Table 
3 an image file that will be used during client re- 
installation is shown in the corresponding row of the 
rightmost column. For configurations 9, 8, 6, 4, and 2 
there is no image in the cache that represents their 
exact configurations. The amalgam image, depicted as 
9, 8, 6, 4, 2, will be used in the imaging phase of client 
installation for configurations 9, 8, 6, 4, and 2. For 
other system configurations, there is an image file that 
represents that configuration. 


Table 4 shows the time it took to install each of 
the ten configurations using the three approaches. 
RedHat Kickstart was used for custom installation and 
Frisbee was used to derive the timing results for imag- 
ing approach. For those configurations that had a cor- 
responding image in the cache under AIS, the time is 
nearly identical to Imaging. The one second delay is 
roughly how long it took to complete the host configu- 
ration, as can be seen in Table 6. Those configurations 
that used the amalgam image, took slightly longer than 






____| Confit | Conf2 | Conf3 | Conf4 | Conf5 | Conf6 | Conf7 | Conf8 | Conf9 | Conf10_ 
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Imaging because of the additional packages that 
needed to be installed and host configuration. How- 
ever, the Imaging approach required almost twice the 
disk space for storing image files as AIS, as seen in 
Table 5. 


To further demonstrate interesting implications 
of AIS approach in maintaining a system configura- 
tions infrastructure, we compare the behavior of Imag- 
ing and AIS under evolving configurations in Figure 
6. This can occur when the administrator wants to 
keep all versions of the same configuration as it 
evolves (patches applied, new packages installed) 
throughout its lifetime. In typical imaging approach, 
an image for every version of the configuration must 
be maintained, consuming much more storage than the 
size of the changes. 


In Figure 6 AIS maintains only one image, with 
which it can achieve any version of configuration. As 
new versions are introduced, they may get incremen- 
tally longer to install, because they deviate more from 
the imaged configuration, as shown in Scenario 1. 
Alternatively, AIS can update the image so that instal- 
lation of most recent versions takes the least time, as 
shown in Scenario 2. 


Related Work 


A number of tools allow for automated and unat- 
tended installations. Tools that allow metadata-based 
installations, among others, include RedHat Kickstart 
[15], FAI [7, 8], and LUI [13]. They can support large 
number of configurations, but take a relatively long 
time to install and couple the system specification and 







Table 4: Timing of custom install, pure imaging and AIS approaches. 







Installation Pkg. 
Repository 
Kickstart | 1.8 GB 


0 or 1.8 GB 9.7 GB 
1.8 GB 5.1 GB 





Image 
Repository 





Table 5: Storage requirements for three approaches. 





Additional 





|| Conf 1 | Conf2 | Conf3 | Conf4 | Conf5 | Conf6 | Conf7 | Conf8 | Conf9 | Conf 10_ 


; 0:00:00 | 0:00:00 | 0:00:00 | 0:02:34 | 0:00:00 | 0:03:22 | 0:00:00 | 0:04:03 | 0:04:28 | 0:00:00 
Installations 
eet aration | °200:0 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 | 0:00:01 
Configuration 


Table 6: Detailed timing for AIS. 
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the installation technique. SystemImager [7, 9] and 
Frisbee [3] are tools that follow the imaging approach. 
As such, they install quickly, but lack flexibility in 
configuration and require disk space that increases 
with the number of system configuration variations. 


A number of more sophisticated host configura- 
tion management tools exist which attempt to solve the 
larger problem of creating and maintaining software 
configurations across a large number of servers and 
desktops. These tools are discussed in [2, 1, 14, 5, 16, 
12, 11] and are complementary to the AIS approach, as 
AIS only attempts to solve the problem of fast, efficient 
installation and could easily be matched with a SCM 
system for ongoing management. Currently the host 
configuration phase of the client installation process is 
achieved by copying over the necessary configuration 
files from the Installation Server. While this is suffi- 
cient for a prototype implementation, the overall value 


Disk Space Required (in GB) 
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of the system would increase if one of the SCM sys- 
tems was incorporated as one of the phases of the 
overall AIS process. 


Availability 


The AIS system as well as additional documenta- 
tion is available for download under an open-source 
license at the web site http://www.ensl.cs.gwu.edu/ 
projects/ais/ . 


Future Work 


AIS is a prototype implementation of a hybrid 
approach to installing software on a large number of 
client machines. The current implementation is limited 
to RPM. However, conceptually the ideas can carry 
over to other package management systems, as well as 
to using multiple package management systems on the 
same machine, if necessary. What is important is the 
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new configurations, we need to create the file 
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system so that a snapshot can be taken, or 
updating a golden server if available. AIS 
does not have this overhead, since it can use 
best available image from previous 
configurations to arrive to new system 
configuration. 


Scenario 1: 













5 


Time does not increase because same package 
is updated. To arrive to state 3 we do not go 
through state 2, but go directly to from 1 to 3. 


(Not drawn to scale.) 
Expected time for AIS is 
~, marginally slower because 
_ of additional packages that 
need to be installed after 
imaging. 


Time to Install (in minutes) 


5 


System Lifetime 


_~—~—— 


| Image rebuilt to account for changes to 
| Original configuration 


Scenario 2: 













| 
4 


wD 


0-60/See note for Scen. (Not drawn to scale.) 
Rebuilt image represents 
a more recent configuration. 
Less post-imaging actions 
are necessary, So overall 
time is less as compared to 
Alt 1. To achieve 
configuration that is older 

., than the image, AIS will 

~ perform steps in reverse. 


_— ee ee eee 


Time to Install (in minutes) 


System Lifetime 


Figure 6: Comparison of disk and installation speed for Imaging and AIS solutions when the system configuration 


evolves. 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


111 


AIS: A Fast, Disk Space Efficient “‘Adaptable Installation System” ... 


ability to determine the makeup of the system’s con- 
figuration in terms of installed packages, be it from 
one or more package management systems’ databases. 


Ability to query and list the content of the system 
is a significant advantage provided by package man- 
agement systems; something that is not available by 
default in systems where a lot of software components 
are installed from source, “tar.gz” files or similar 
mechanisms. Furthermore, source installations do not 
provide dependency information and thus are not natu- 
rally suited for AIS-type systems, which construct sys- 
tem images from metadata descriptors. 


Conclusion 


AIS combines the features of custom install tools 
and imaging tools and provides a beneficial balance of 
ease of maintenance, scalability to large number of 
system configurations, and speed of client re-installa- 
tion. This paper demonstrates how a hybrid, caching 
installation system can achieve both fast installation 
and low disk space use. These ideas can apply to 
imaging and installation tools other than Frisbee and 
RedHat Kickstart. 
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ABSTRACT 


It is often difficult and time-consuming to manage computer ‘moves, adds, and changes’ that 
take place in a switched, subnetted environment. It is even more difficult when the accepted 
network policy requires that a computer be configured and always connected to the network on a 
specific Virtual LAN (VLAN) based on its usage. An on-going problem is to keep all of the 
network host information up-to-date, and to ensure that hosts always land on the correct subnet 
when they are plugged into switch ports. Utilizing some freely-available tools, as well as some 

* home-grown software, we have built a system that automates a number of the tasks associated 


with moves, adds, and changes. 


Where We Were 


The Department of Computer Science at Princeton 
University has a routed IP network of 1500+ hosts, with a 
VLAN assigned to each subnet. IP addresses are assigned 
either from Princeton’s address space or from any of a 
number of private IP address blocks, as defined in RFC 
1918.1 Hosts with private IP addresses are subject to Net- 
work Address Translation (NAT) by a CheckPoint firewall 
before any of their network traffic is sent out to the rest of 
the campus or the Internet. All intra-departmental routing 
is handled by a Foundry FastIron 1500 ethernet switch. 


The department has a role-based network model, 
meaning that a host’s place on the network is deter- 
mined by its function and who uses it. Each host has a 
statically-assigned IP address tied to its ethernet MAC 
address. The network model specifies how hosts 
should be mapped to IP subnets, and how subnets cor- 
respond to VLANs. A_ faculty member’s office 
machine belongs on the faculty-office VLAN. A host 
used by a member of the administrative staff belongs 
on the admin-staff VLAN. A graduate student’s office 
machine belongs on the grad-office VLAN, while a 
host used in that graduate student’s research belongs on 
a VLAN specific to the project. Figure 1 shows VLAN 
assignments for some of the roles a host might fit into. 


1Address Allocation for Private Internets 





| Class | Membership | Subnet_— | VLAN. 
Faculty | __ faculty office hosts | 128.112.aa.0/24 | aa 
|Grads___|_graduate student office hosts | _128.112.bb.0/23__| bb __ 


PlanetLab PlanetLab nodes 128.112.ccc.64/26 


Note that the main purpose of our network model 
is to segregate hosts by usage. In general, there are no 
restrictions on traffic between IP subnets. However, as 
some of the subnets are in private IP address space, 
access to hosts on those subnets is restricted from out- 
side of the CS department. 


As mentioned above, routing within the CS 
department is handled by a Foundry FastIron 1500 
ethernet switch, which is at the core of our network. A 
number of other switches, from various vendors, are 
attached to it. They are configured as either “‘infra- 
structure’ switches or “user” switches. The infra- 
structure switches are used for connections to servers 
and other network devices, such as printers, IEEE 
802.11b [7] access points and network cameras. Infra- 
structure switch ports are in physically secure areas, 
preventing users from connecting to them. The user 
switches are used for network connections to end-user 
hosts. Most of the time spent dealing with ports is 
devoted to user switch ports. 


There are a number of routine tasks involved in 
managing this network, including: host registration, 
maintaining compliance with the network model, and 
management of physical network connectivity. Prior to 
the implementation of autoMAC, these tasks required 
a lot of time and intervention by the Computer Science 
technical staff. 










Figure 1: Example network model host roles. 
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Host registration was accomplished using our 
inventory/host database system that was built around a 
Berkeley DB file, processed by a number of perl 
scripts. Hosts were entered into the system by a script 
that called ‘vi,’ allowing edits to be made on a tem- 
plate host entry. The same script allowed changes to 
be made to existing entries. Since the data entry envi- 
ronment used a simple text editor, it was difficult to 
assure accuracy and consistency of the entries. Once 
the required changes had been made to the DB file, 
DNS, DHCP, and NIS configuration files were gener- 
ated and installed by using the ‘make’ command. Note 
that all of the preceding steps were carried out by a 
member of the technical staff. 


In the process of entering or changing a host 
entry, it was necessary to examine the supplied infor- 
mation and make a decision about which VLAN the 
host should be assigned to. One problem with this 
method is that, given the same information, different 
technical staff members sometimes made different 
decisions, resulting in hosts with the same roles being 
assigned to different VLANs. Consequently, some 
hosts ended up where they didn’t belong. 


Port configuration changes on the switches 
required an administrator to connect to the appropriate 
switch and to change its settings manually. If needed, a 
patch cable would be installed to connect the switch 
port to an office wall box. When a user moved a 
machine from one location to another, it was very 
likely that either a change in the switch configuration or 
change on the patch panel (or both) would be required. 


The process of registering a host and getting it 
connected to the network, or moving it to a different 
VLAN because its role in the network model changed, 
frequently involved multiple email exchanges between 
the requesting user and the technical staff. Often, the 
initial request would be something like “‘Please regis- 
ter my new computer on the network with the name 
mongo.” As this was not nearly enough information 
to process the request, we would send a reply asking 
for the rest of the information we needed. It might take 
several rounds of email replies to get all of the infor- 
mation needed to register the host and to activate a 
wall box jack. 


Including the time spent awaiting additional 
information from the user, it could be hours from 
when the initial request was received before a new 
host could be used on the network. The only semi- 
automated task was the generation and installation of 
the configuration files. 


Another time consuming task was isolating a 
host that was the source of a network problem. It was 
necessary to locate which switch port it was using and 
to reconfigure the port manually. Some local com- 
mand-line tools were available to assist in identifying 
the problem host, but it was still necessary to connect 
to the appropriate switch and reconfigure the port in 
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order to move the host to a different subnet for isola- 
tion or testing. 


While packages are available to manage host 
registration (such as NetReg [11, 10]) and switch con- 
figuration (such as Splat [2]), no freely available pack- 
age ties everything together and automates all of the 
tasks our staff is required to do in order to manage 
how hosts connect to the network. 


What We Did 


The system we envisioned and then implemented 
allows users to register new hosts using a web inter- 
face, with minimal intervention by the technical staff. 
The request is processed automatically by a daemon, 
and the new host is typically available for use on the 
network within 10-15 minutes. The host can be con- 
nected to any “‘public”’ ethernet jack and will automat- 
ically be placed on the proper VLAN. Public ethernet 
jacks include those in open spaces as well as depart- 
mental offices and labs. In general, any ethernet jack 
that is accessible to people outside of the technical 
staff is considered public. 


Access to the web form is restricted to members 
of the CS department by requiring that they enter their 
departmental username and password. The form 
presents the user with questions to determine the 
proper class and VLAN of the new host. Infrastructure 
VLANs used by secured service hosts are not avail- 
able on the switch ports the users can access. In addi- 
tion, we are in the process of implementing an 
enhancement that will offer to each user in the depart- 
ment only their allowed user classes. 


The idea of a web interface for host registration 
is not new. What makes autoMAC different is that it 
enables automated control of how a host is connected 
to the network after registration. 


Automatic VLAN Assignment 


A key feature of the autoMAC system is that a 
user can plug their host into any public ethernet jack 
attached to one of our user switches, and it will be 
automatically connected to the correct VLAN. This is 
accomplished using a RADIUS [6] server that 
“authenticates” the MAC address of the host attempt- 
ing to connect to the network. 


A number of web-based and IEEE 802.1X [9] 
authentication/authorization systems exist that require 
information to be manually entered or supplicant soft- 
ware to be installed on a host before it can access the 
network. These systems are well suited for authorizing 
user access to a network, but are not designed to work 
with unattended devices, such as printers or network 
cameras. In addition, requiring our users to install 
IEEE 802.1X supplicant software before they access 
the network is not practical, given a stream of fre- 
quent, temporary visitors. Ideally, for our purposes, it 
should be possible to connect any type of ethernet 
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device to any switch port and have it placed on the 
correct VLAN without any additional human interac- 
tion. Using a MAC address to identify a host, rather 
than a name and password to identify a user, makes 
this possible. Our users are still required to authenti- 
cate with their username and password to access 
departmental servers. 


The RADIUS server configuration is built from 
data in our host database. For each MAC address, a 
“user” entry is generated, using the MAC address as 
both the username and password. The entries contain 
additional tags that specify to which VLAN a host 
belongs. 


A host whose MAC address is not listed in the 
host database is considered unknown, and is therefore 
not listed in the RADIUS configuration file. When a 
user switch makes an authentication request with an 
unknown address, the RADIUS server responds in the 
negative. The switch then sets the port to which the 
host is connected to a registration VLAN, where the 
NetReg server presents our web-based host registration 
interface to the user when a web browser is started. 


If a user moves a host from one wall box jack to 
another anywhere in the Computer Science building, 
the host stays on the correct VLAN since the switch 
port behind it is automatically configured using infor- 
mation provided by the RADIUS server. To move a 
host to a different VLAN, we simply need to change 
its IP address in the host database to an address on the 
other VLAN and rebuild the configuration files. The 
next time the host is connected to any switch port in 
the network, it is automatically placed on the new 
VLAN. If we need to force the issue, we can use tools 
we have developed to identify the switch port cur- 
rently being used by the host, and turn the port off and 
back on. This forces the switch to re-authenticate the 
host the next time the host sends out any network traf- 
fic, causing it to be moved to the new VLAN. 


Implementation Details 


This system was not built from scratch. Rather, it 
was constructed by modifying our existing tools, and 
combining them with a number of open source pack- 
ages available on the Internet. Specifically, we used 
FreeRADIUS [5] and NetReg, in addition to our exist- 
ing host database and new code written to work with 
our web server and the above packages. 


As mentioned earlier, our host database system is 
built around a Berkeley DB file. The vi-based data 
entry interface allows far too much latitude in what 
may be entered for a given field, making field valida- 
tion difficult. However, the system is implemented 
with perl scripts that use a common library of func- 
tions to manipulate the data in the DB file. Having this 
library makes it fairly easy to implement new func- 
tionality such as bulk or daemon-based data entry. 


The development of autoMAC began in July 
2003, when we decided to implement a FreeRADIUS 
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server to control access to our IEEE 802.11b wireless 
infrastructure. The Cisco access points (APs) we were 
using could be configured to query a RADIUS server 
using a MAC address for the username and password. 
The AP would then either allow or deny association of 
the client machine based on the response of the 
RADIUS server. While we realize that this is not 
strong security, it did, along with turning off broadcast 
of the SSID, prevent casual association with our wire- 
less infrastructure. 


While we were working on the FreeRADIUS 
server, we checked to see what other pieces of our net- 
work infrastructure might be able to make use of a 
RADIUS server. We discovered that the Foundry 
switches could utilize a RADIUS server to authenti- 
cate users, and configure their ports to specific 
VLANs. If this functionality could be extended to use 
MAC addresses for usernames and passwords, it 
would give us a very useful tool. 


We spoke to a number of ethernet switch vendors 
to express our interest in MAC-based ‘“‘authentica- 
tion” for their switches, and received positive 
responses from several of them. As our user switches 
are from Foundry, we encouraged them to add this 
feature to their FastIron line, which they did. We have 
subsequently learned that several other switch vendors 
also now offer this feature, or will be offering it soon. 


One of our goals for this project was to simplify 
the task of host registration for both our user base and 
the systems staff. To this end, we investigated two 
web-based host registration systems: Southwestern 
University’s NetReg [11] and Carnegie Mellon Uni- 
versity’s NetReg/NetMon [10]. Both of these systems 
had features that we liked, but they also did much 
more than we needed. In the end, we built our own 
NetReg, utilizing some of the code from Southwest- 
ern’s system which they graciously released under the 
GNU GPL [3]. 


Our NetReg implementation consists of a Linux 
system running the Internet Systems Consortium’s 
BIND (DNS) and DHCP software, configured accord- 
ing to the instructions provided with Southwestern’s 
NetReg system. The Linux system is also running the 
Apache Web Server, configured to use SSL and serve 
our custom registration page. The machine is con- 
nected to a dedicated registration VLAN as well as 
one of our production infrastructure VLANs. The in- 
frastructure connection allows communication with 
our existing host database system. The machine uses 
an ‘ipchains’ firewall to limit vulnerability to attack 
from the registration VLAN and has routing disabled 
to prevent traffic from leaving the VLAN. 


Tying Things Together 


Once all of the parts of the puzzle had been identi- 
fied, we needed to piece them together into a usable, 
cohesive system. We had a host database that associated 
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MAC addresses with IP addresses. We wanted a web- 
based system to add entries to the database. We had 
switches that could set ports to VLANs based on data 
from a RADIUS server, and we had a RADIUS server 
that could send the data. Now we needed to write 
some code to tie the components together. 


Entering data in a text editor, based on email 
received from users, was an error-prone process. It 
was also a job we were looking to get out of. A web- 
based interface with field validation seemed to be a 
very good idea for a replacement. 


The front-end we implemented allows users and 
support staff to easily enter all of the information 
required to register a new host and submit the request. 
Users of this interface are required to authenticate 
themselves with the same username and password 
used for connecting to our Solaris and Linux systems, 
as well as our e-mail server. We use SSL encryption to 
prevent user credentials from traveling over the net- 
work in the clear. All of the web pages are generated 
using PHP [4] scripts. 


*| Computer Science - New host request form - Netscape 
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Figure 2 shows the host registration form. Exten- 
sive field validation is done to prevent duplicate 
entries and to ensure compliance with our network 
model. When the user submits the form, the request is 
sent to a new-host daemon that does the actual host 
registration and configuration file generation. The 
request is sent as a specially-formatted email message 
on behalf of the user filling in the form. The daemon 
parses the message body and invokes a perl script to 
update the host database DB file. It then runs a ‘make’ 
to generate and install all of our configuration files. 


RFC 2868? defines RADIUS attributes to be used 
for “compulsory tunneling in dial-up networks.” These 
attributes are leveraged by IEEE 802.1X (in Annex D) 
and RFC 3580, which specify additional tunnel 
attribute information that can be used to assign VLANs 
to hosts being authenticated by a RADIUS server. In a 
VLAN environment, these attributes are returned as 


2RADIUS Attributes for Tunnel Protocol Support 


3IEEE 802.1X Remote Authentication Dial In User Service 
(RADIUS) Usage Guidelines 


_) "Oates One Sine Qn At Gea 


nce - New host request form 


(Fields in Red are required) 
Ifyou are registering a host for another person please include their information in the first 5 fields below. 
If you already registered the host with OIT please go here. 


CS Username: tengi 
First Name: 
Last Name: 


Technical Contact Email Address: tengi@cs princeton edu 


User Class: 
Hostname: 
Host Class Type: 


Ethernet/MAC-Address 1: 0:c0.08:8026 2d Wired Ethemet (e.g. 08:00:7:1A:3D:56 or 0800071A3456) 


| 
Ethernet/MAC-Address 2: | Wired Ethernet » 


Ethernet/MAC-Address 3: 
Model Number: 
Serial-Number: _ 

PU-TAG/Asset-Number: 
Building/Location: Computer Science ¥ 

Room Number/ Address: 
Connection Type: [Ethernet] | OWireless] | () Fiber] [ DSL] [ (1 No-Media] 

Operating System: Linux ¥ 
Host Owner: Princeton Universiy ¥ 


Host Expire Date: 


Wired Ethernet ; ¥ 


(YYYY-MM-DD) 
Comments? 





Figure 2: Host request form with default information filled in. 
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part of the ‘Access-Accept’ message from the RADIUS 
server when access is granted to a host, and can be used 
by the switch to configure a port’s VLAN. 


We generate a ‘users’ file for the RADIUS server 
that contains entries for every host registered in our 
database. There is a VLAN-based look-aside table 
used in this process to determine whether or not tunnel 
attributes should be added to the entry. Figure 3 shows 
an example RADIUS user entry with tunnel attributes 
while Figure 4 shows one without. 


One of the VLANs that is not in the look-aside 
table is the one we use for 802.11b wireless access. 
Therefore, a host registered for wireless access would 
be listed as in Figure 4, without tunnel attributes. This 
is the type of entry we started with when we first 
implemented the RADIUS server for use with our 
802.11b Access Points. 


When a host is connected to one of our user 
switch ports, the switch sends an Access-Request mes- 
sage to the RADIUS server with the host’s MAC 
address as the username and password. The RADIUS 
server returns either an Access-Accept or an Access- 
Reject message to the switch, based on whether or not 
the MAC address is known. If an Access-Reject is 
returned, the switch sets the port to the registration 
VLAN so that the user may register the machine using 
NetReg. The switch will also set the port to the registra- 
tion VLAN if an Access-Accept is received without any 
VLAN information included. On the other hand, if the 
RADIUS server recognizes the address and has VLAN 
information for that address, it returns an Access- 
Accept message with the VLAN information which the 
switch then uses to set the port to the correct VLAN. 


When a host is connected to the registration 
VLAN, its DHCP request is answered by the NetReg 
server. The reply specifies the NetReg server itself as 
the router for the host to use. Therefore, any subse- 
quent IP traffic from the host will be sent to the NetReg 
server. The only other devices with IP addresses on this 
VLAN are other hosts awaiting registration, so it is 
unlikely that any network traffic will leak out from this 
VLAN to the rest of the campus or the Internet. 


The DHCP reply also specifies the NetReg server 
as the DNS server to use. The DNS daemon code is 
configured to return the NetReg server’s IP address for 
any lookup received. This means that hosts on the reg- 
istration VLAN will be directed to the NetReg server 
for any web sites they may try to contact. 


00d0b7b600ee 
Tunnel-Type = VLAN, 
Tunnel-Medium-type = 802, 
Tunnel-Private-group-ID = 702, 
Fall-Through = Yes 


autoMAC: A Tool for Automating Network Moves, Adds, and Changes 


The Apache HTTP daemon is configured to 
present a login page for any unknown URL. This login 
page tells the user that they are using an unregistered 
host, and prompts them for their Computer Science 
username and password so that they may register the 
machine. This limits requests to members of the CS 
department. If someone is visiting the department for a 
few days and wants to use their laptop on the network, 
it needs to be registered by the member of the depart- 
ment they are visiting. 


Upon successful authentication, the user is redi- 
rected to the host registration form shown in Figure 2. 
Submitting the form sends the registration request to 
the new-host daemon and displays a page to the user, 
stating that the registration request is being processed 
and instructing the user to disconnect from the net- 
work port and reconnect in ten minutes. When the host 
is disconnected or rebooted, the RADIUS authentica- 
tion process starts all over again. This effectively 
removes a host from the registration VLAN once it 
has been successfully registered. 


The email address to which the registration infor- 
mation is mailed uses procmail [12] to validate the mes- 
sage and save it in a spool directory. The new-host dae- 
mon, mentioned earlier, processes messages from the 
spool directory and adds the new hosts to the host data- 
base. The daemon then runs the ‘make’ command to 
generate and install all of the required configuration files. 


An Example Scenario 


Here then is an example scenario, showing how 
all of the parts operate together. The assumptions are 
that a Computer Science user wants to register a new 
host on the network; that the host will try to get an IP 
address with DHCP; and that the host will try more 
than once to obtain an IP address. Switch ports are 
configured to be on an unused, isolated VLAN until 
the switch receives information from the RADIUS 
server. Note that the level of detail in the following list 
items will vary, to prevent us from getting bogged 
down in low level details. 

e The user connects the new host to a public eth- 
ernet jack, which is connected to a port on one 
of our “user” ethernet switches, and turns on 
power to the machine. The machine attempts to 
obtain an IP address using DHCP. 

e The ethernet switch blocks the DHCP request, 
extracts the host’s MAC address from the request 


Auth-Type := Local, User-Password == "00d0b7b600ee" 


Figure 3: RADIUS user entry with tunnel attributes. 


0O0cO04fF7£1006 
Fall-Through = Yes 


Auth-Type := Local, User-Password == "00c04f7£1006" 


Figure 4: RADIUS user entry without tunnel attributes. 
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and uses the MAC address in an Access-Request 
message to the RADIUS server. 

Since this is a new host, the RADIUS server’s 
configuration files do not list the MAC address, 
so it returns an Access-Reject message to the 
switch, which then configures the port to be on 
the registration VLAN. 

Because the switch blocked the host’s first 
DHCP request, the DHCP client process on the 
host times out and sends another request. 

The NetReg server on the registration VLAN 
receives the DHCP request from the host and 
responds with a short-lived IP address on the 
registration subnet, as well as default router and 
DNS server addresses pointing to the NetReg 
server. The host uses this information to config- 
ure its network interface. 

The user starts a web browser on the host and 
attempts to connect to some web site. This 
causes the host to send a DNS query for the 
host named in the URL. We assume that the 
URL will be for ‘HTTP’ or ‘HTTPS,’ and that 
the host portion will contain a domain name, 
and not an IP address. The DNS daemon on the 
NetReg server resolves the domain name por- 
tion of the URL to its own IP address, and 
sends the address back to the host. If an IP 
address is used, the HTTP request will time out, 
as the request will not be forwarded off of the 
registration VLAN. 

The web browser on the host now makes a con- 
nection to the HTTP daemon on the NetReg 
server, requesting whatever page was specified 
in the URL. The HTTP daemon responds with 
an information page notifying the user that he is 
using an unregistered host. The page includes a 
link to the host registration web page. 

The user clicks on the link, and is prompted for 
their CS username and password by the HTTP 
daemon. If they enter valid credentials, the 
HTTP daemon sends the host registration form 
to the host. Otherwise, an “access denied”’ 
message is sent, and the user needs to try again. 
The user fills out the form, and clicks the “‘ Ver- 
ify Now” button. A CGI script on the NetReg 
server validates the entry data and, if all is well, 
sends a display-only page to the user with a 
“Submit Now” button at the bottom. Other- 
wise, it displays a partially completed registra- 
tion form, with instructions on how to correct 
any errors. 

The user clicks on the “Submit Now”’ button, 
sending the form information to another CGI 
script which formats and sends an e-mail mes- 
sage, on behalf of the user, to the new-host dae- 
mon. The script also displays a page to the user, 
stating that the registration request has been 
sent and that the host should be rebooted in ten 
minutes. 
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e The new-host daemon parses the e-mail message, 
validates the information, and if all is well, 
selects an IP address and adds the host to the host 
database. The daemon then runs ‘make’ to gener- 
ate and install the configuration files. 

After ten minutes, the user reboots his host. 

When the host reboots, it drops ethernet link, 

causing the switch to remove the port from the 

registration VLAN. The host then attempts to 
obtain an IP address using DHCP. 

The ethernet switch blocks the DHCP request, 

extracts the host’s MAC address from the 

request and uses the MAC address in an Access- 

Request message to the RADIUS server. 

The RADIUS server now has the MAC 

addresses listed in its configuration files, along 

with VLAN information, so it sends an Access- 

Accept message, including the VLAN informa- 

tion, to the switch, which then configures the 

port to that VLAN. 

Because the switch blocked the host’s first 

DHCP request, the DHCP client process on the 

host times out and sends another request. 

e This time, the DHCP request is received by our 
production DHCP server, which responds with 
a reply containing the host’s registered IP 
address, along with our normal DNS server and 
default router information. 

e The user may now use his host normally on the 
CS network. All future reboots and DHCP 
requests from this host will result in the switch 
port to which it is connected being configured 
to the proper VLAN, as the MAC address is 
now known. 


Security Considerations 


While autoMAC limits user host connections to 
the appropriate VLAN (according to our role-based net- 
work model), we do not rely on the system to provide 
strong security. Intentional spoofing of MAC addresses 
is not addressed, as this possibility always exists. 


Anti-spoofing ACLs on our core switch prevent 
hosts from pretending to originate from different sub- 
nets. As we mentioned earlier, users still have to authen- 
ticate with login name and password to access our 
departmental servers. Multiple attempts to register dif- 
ferent MAC addresses from the same port within a short 
period of time could be logged and trigger a report. 


If a host gets a new ethernet address due to a 
hardware change, the user will need to register the 
new ethernet address before he can use anything other 
than the registration VLAN. Ethernet card swaps 
between machines are most likely to occur within a 
user Class (e.g., graduate student to graduate student, 
not graduate student to faculty), so the worst problem 
is one of host identification, not VLAN access. In the 
case of a graduate student swapping the ethernet cards 
of a research host and an office host, the machines will 
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end up on the wrong VLANs, and the student will no 
longer be able to use either machine effectively. 


In the event that a host needs to be quarantined 
due to virus infection or other problems, it can easily 
be moved to an isolation VLAN. The only way for a 
user to circumvent that isolation would be to change 
the host’s MAC address. If the new MAC address was 
unregistered, the host would be connected to the regis- 
tration VLAN. If the MAC address of another regis- 
tered host was used, the infected machine would be 
connected to the VLAN associated with the stolen 
MAC address. If the MAC address was already in use, 
the result would be two machines getting the same IP 
address from the DHCP server, and neither machine 
would work reliably. This would most likely trigger a 
complaint to the technical staff from the owner of the 
host whose MAC address had been stolen. 


Future Enhancements 


With our new system it is relatively easy to 
reconfigure hosts to be on different VLANs. As a 
result, we think it should not be too difficult to inte- 
grate this with other systems that scan automatically 
for active viruses and current patch levels before a 
registered host is allowed network access. In addition, 
infected or vulnerable hosts on the network can be 
quarantined, and transferred to isolated VLANs where 
the appropriate host patching tools are available. 


According to the documentation, our Cisco 
802.11b APs can also make use of the RADIUS tunnel 
tags to specify a VLAN for a wireless client. Adding 
tunnel tag support to the APs will allow us to imple- 
ment a registration VLAN for wireless users as well as 
wired users. It will also allow us to segregate wireless 
traffic by host role, as we do for wired traffic. Cur- 
rently, all wireless users are on the same VLAN. 


Availability 


Some of the tools we have developed are avail- 
able to the public, so that others may implement simi- 
lar systems. The host database system we use will not 
be made available, as we are in the process of replac- 
ing it. The software and some configuration examples 
can be found at http://www.CS.Princeton.EDU/ 
autoMAC’ . 


RFC documents can be found on the Internet 
Engineering Task Force web site http://www.ietf.org/ . 


Conclusions 


In a network environment such as ours, where 
hosts from different VLANs can be plugged into any 
public ethernet jack, it can be difficult and time-con- 
suming to ensure that the switch ports behind those 
wall jacks are always on the correct VLAN for a given 
host. Using a number of different tools and some 
switch firmware features, we have put together a sys- 
tem that uses the host MAC address to automatically 
configure switch ports properly. 
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Combining automatic port configuration with 
web-based automated host registration allows users to 
start using new hosts on the network with little or no 
intervention by the technical staff. This means that if a 
user brings a new host into the department after nor- 
mal business hours, he won’t have to wait until the 
next day to start using it. When paper deadlines are 
looming, and our students need to add a few machines 
to the network in order to complete their experiments 
(at 3 a.m.), near-instant host registration is a big win. 


By implementing this system, we have moved 
from a mostly manual to a mostly automated method 
of handling network moves, adds, and changes. This 
has improved response time to our users and also pro- 
vided support staff with time to work on other 
projects. Additionally, we have improved compliance 
with our network model, by automating the switch- 
port-to- VLAN mapping. This project would not have 
been possible without the availability of a number of 
Open Source Software projects. 
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ABSTRACT 


Analysis of network traffic is becoming increasingly important, not just for determining 
network characteristics and anticipating requirements, but also for security analysis. Several tool 
sets have been developed to perform analysis of flow-level network traffic, however none have 
had security as the primary goal of the analysis, nor has performance been a key consideration. 


In this paper we present a suite of tools for network traffic collection and analysis based on 
Cisco NetFlow. The two primary design considerations were performance and the ability to build 
richer models of traffic for security analysis. Thus the data structures and code have been 
optimized for use on very large networks with a large number of flows. Data filter rates are 
approximately 80 million records in less than 1.5 minutes on a Sun 4800. 


Introduction 


Cisco NetFlow [11] is becoming an increasingly 
popular method for analyzing network traffic, and sev- 
eral tools (e.g., [3, 8, 1]) have been developed to take 
advantage of this flow information. However, most of 
these tools have been developed within the context of 
academic settings, where performance was not critical. 
The SiLK Suite’ was developed to provide analysis 
tools for very large installations, such as large corpo- 
rations, government organizations, and backbone ser- 
vice providers. These sites often transfer large vol- 
umes of data, much of it extraneous (e.g., worm traf- 
fic, scanning activity). 


In addition to having performance as a key ele- 
ment of the SiLK Suite, the tool set was developed 
with security analysis as a primary goal. This has 
facilitated the development of a new suite of tools that 
allow information filtering in a manner unavailable in 
other tool sets. This suite of tools has been field tested 
at a large ISP, and is now in operational use at this site. 
For example, the tool set is able to process approxi- 
mately 80 million records in less than a minute and a 
half on a Sun 4800. 


This paper provides an introduction to the SiLK 
Suite tool set, describing both the collection system 
and the analysis tools. It provides examples of how to 
use the analysis tools and the types of analyses that can 
be performed. SiLK is then compared to related tools. 


The SiLK Suite can be down-loaded from http:// 
silktools.sourceforge.net/ [5]. 


Overview of the SiLK Suite 


The SiLK Suite consists of two primary compo- 
nents: the collection system and the analysis tool set. 


'SiLK stands for System for Internet Level Knowledge, 
with the SLK capitalized in memory of Suresh L. Konda, 
who was the founder of the project. 
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The collection system converts Cisco NetFlow Ver- 
sion 5 PDUs into a compressed binary format. The 
tools work on these compressed records, allowing a 
user to filter data in a variety of ways and to use a 
series of command line tools for data summarization. 


SiLK was originally designed to address a prob- 
lem inherent in traffic analysis: traditional payload- 
based analysis can make accurate judgments with a 
relatively small amount of data, but traffic analysis 
requires larger volumes of data to assess trends and 
large scale behaviors. Coupled with the volume of 
data seen on our client network, the amount of traffic 
summaries received was on the order of tens of Giga- 
bytes a day. 


In order to manage this volume of traffic, SiLK 
adopted three strategies: the footprint of individual 
records was reduced to the minimum necessary to 
store security-relevant information and nothing more, 
files were split into several common pre-defined cate- 
gories to reduce the amount of time to look for spe- 
cific traffic, and the SiLK Suite was made gzip-trans- 
parent. SiLK reads gzipped or unzipped files transpar- 
ently, which yields a substantial performance bonus 
(in our experience, gunzipping a file in memory is 
cheaper than loading the unzipped file). 


Collection System 


The collection system has been designed to mini- 
mize the amount of disk space required to store data, 
while still supplying the data required for security 
analysis of network traffic. The collection system 
takes Cisco NetFlow Version 5 PDUs and converts 
them to a “‘packed”’ format. The packed records are 
stored in a hierarchy based on the router class (e.g., 
ingoing, outgoing) where this information is specified 
by the type of record (e.g., in, inweb, out and outweb), 
and date and time, with hourly files available at the 
leaves. A separate file is maintained for the flow 
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records from each router. Only flows that have been 
routed are recorded — flows representing traffic that 
was dropped by the router due to an access control list 
(ACL) violation are not saved.2 A NetFlow record 
consists of 48 bytes. We reduce this record size to 
achieve disk storage savings via three approaches: 
¢ do not store the fields that are not required 
e reduce the number of bits used to store some 
information 
e remove the storage of fields where the file hier- 
archy can indicate the same information 


The first approach results in the removal of eight 
fields from the NetFlow record. In particular, informa- 
tion about the network path is not maintained. This 
results in the removal of the following fields: input 
interface number, output interface number, the source 
AS number, the destination AS number, the source 
mask, the destination mask, and the next hop IP num- 
ber. In addition, the type of service information is not 
kept. This results in a savings of 19 bytes per record.% 
Additional space savings comes from a reduction in 
the number of bits used to store various information. 


For example, time information is only stored 
with accuracy to within one second, rather than the 
millisecond precision provided by NetFlow. In addi- 
tion, the packed record only uses 12 bits to store the 
start time of the flow, and 11 bits to store the elapsed 
time of the flow. The header for the packed data file 
contains the start time for all of the records in the file. 
As each file contains only one hour of flow data, the 
start time only needs to be the number of seconds 
since the start of the file. The end time is actually the 
number of seconds elapsed since the start of the flow. 
By default, NetFlow flushes a continuous flow after 
30 minutes, and so the elapsed time requires one fewer 
bit than the start time (based on this default). If a site 
is using a different configuration (in particular, flush- 
ing less frequently than 30 minutes), then the code will 
need to be modified to accommodate this difference. 


Other bit savings come from storing the average 
bytes per packet, rather than the absolute number of 
bytes. There are 14 bits dedicated to the number of 
bytes per packet, and an additional six bits to represent 
the fractional portion of the value. Additionally, only 
20 bits are used to represent the number of packets in 
a flow. If this value overflows, then an overflow bit is 
set. Therefore we only have accuracy in this field up 
to approximately one million packets. After this, we 
use a multiplier to achieve greater values, but at the 
cost of accuracy. It is important to note that implicit in 
this design is the concept that flows with small pay- 
loads are more important or interesting than larger 


2We currently do process flows that encounter ACL viola- 
tions on one of our client sites, where these records are 
saved in a different directory in the file hierarchy, however 
this code is not yet available in the open source release. 

3Future versions might incorporate some of these values, 
however the open source version currently does not save this 
information. 
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ones, and hence we are not concerned with the exact 
values for larger flows. However, the overflow bit has 
been designed to still provide information (by acting 
as a multiplier), rather than being used as an error flag. 


As noted above, some information has been 
removed from the data record and is maintained by the 
structure of the underlying file system. For example, 
the directory hierarchy specifies the type of record as 
the second level in the hierarchy, where the type of 
record can be either web or non-web. For many large 
networks, the majority of traffic consists of web-based 
traffic. By splitting out web traffic into a separate loca- 
tion, the number of bits required to store port informa- 
tion for web records can be reduced. That is, the 
ephemeral port information is maintained (16 bits), 
while only two bits are used to represent the web port 
in use (where the web port can only be one of 80, 443 
or 8080). A third bit is used to indicate if the web port 
is the source port or destination port. In addition, since 
all web traffic uses the TCP protocol, there is no need 
to store the protocol information. In total, these 
changes result in a savings of 21 bits, which is a sav- 
ings of 2-bytes per record in disk storage. For a site 
that sees ten million web flows per hour (incoming or 
outgoing), there is a savings of nearly 500 MB per day. 


Using all these techniques results in a flow 
stored as source IP address, destination IP address, 
source port, destination port, protocol, flag combina- 
tion, start time, elapsed time, number of packets, and 
the bytes per flow (which is converted to number of 
bytes by the analysis tools). The packed record only 
requires 22 bytes of on-disk storage, while the original 
NetFlow PDU requires 48 bytes. For the web data, this 
is reduced even further to 20 bytes per record. For 
sites that experience large volumes of traffic, this can 
result in significant savings in disk space. 


Analysis Tools 
Data Manipulation 


The SiLK Suite currently provides 13 tools and 
seven associated utilities. Libraries are provided for 
reading the packed records, which performs file glob- 
bing (using fglob library calls, which is how file glob- 
bing will be referred to for short) based on information 
provided through a standard set of arguments. These 
arguments specify the start and end date/time, the type 
of data (e.g., incoming or outgoing, web or non-web), 
and the sensor (router). These flags can be provided to 
the tools that read the packed records, and are used to 
specify exactly what files are read. This allows for 
enhanced performance by reducing the amount of traf- 
fic that needs to be searched, as only the relevant por- 
tions of data are examined. (In other tools, all traffic is 
maintained as flat files, where the analytical tools 
require the user to specify the input data file. The 
directory hierarchy employed by SiLK, however, is 
incorporated into the analysis tools, allowing the ana- 
lyst to focus on the behaviour of interest, rather than 
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needing to find the appropriate file. Additionally, the 
hierarchy allows the tools to more quickly locate small 
files containing the information for the time and sen- 
sor of interest.) The tools default to using incoming 
traffic, both non-web and web, for all sensors. Addi- 
tionally, if only a day is provided for the start date, 
then the default is to process the entire day. If an hour 
is provided, then the default is to process only that 
hour. To process some other amount of time (e.g., two 
hours), the end date flag is required. If no start date is 
provided, the tools default to using the data available 
so far for the current day. 


The primary tool is rwfilter.4 This tool reads the 
packed data and can filter based on various options. 
These options include filtering based on the start or 
end time of the flow, the duration of the flow, the 
source or destination ports, the protocol, the number 
of bytes or packets, or the flag combination (for TCP). 
Perhaps one of the most useful options that can be 
provided for filtering is the source and destination 
addresses, which can be provided as single addresses, 
ranges of addresses, or as a set of unrelated IPs (called 
an ipset and described more fully below). Alterna- 
tively, all IPs that are NOT in the set provided can also 
be used. In addition to the base command line func- 
tionality, rwfilter has the ability to incorporate a user- 
compiled dynamic library. This library can be used to 
filter records based upon criteria that are too compli- 
cated to express on the command line, to perform 
“canned’’ queries more quickly than the command 
line would allow, and to perform stateful operations on 
large sets of flow records. For example, a user can 
define their own set of important services (e.g., 
dns/tcp, dns/udp, web, other tcp, etc.), using the 
dynamic library to count the number of flows for each 
of these services, and printing the results at the end of 
the rwfilter command. 


The rwfilter tool provides two output options: 
--pass and --fail. The --pass option allows all data meet- 
ing the specified filtering options to be saved to a file 
(or, alternatively, stdout can be specified here, if the 
results are to be piped through another command), 
while the --fail option saves all those records that did 
NOT meet the filtering criteria. Both options can be 
used at the same time, allowing a user to chain rwifilter 
commands, where data that meets a condition can be 
saved in a file via --pass, while those records that fail 
the condition can be piped into another rwfilter com- 
mand via -fail=stdout. The data files that are generated 
by rwfilter are in a binary format similar to that gener- 
ated by the packing system, and which can also be 
read by the rw commands, which allows commands to 
easily be chained together. (Records output from rwfil- 
ter no longer have the contextual information provided 
by the file hierarchy and therefore fully expand fields 


4We use rw as a short-hand for raw. This is a historical con- 
vention, and refers to the type of packed files we are using 
and the type of data we are receiving. 
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such as start time. The result is a homogeneous stream 
of 32-byte records.) 


A major functionality provided by this tool set is 
a binary representation of IP addresses, called an ipset. 
The ipset data structure is effectively a dynamically 
expanding checklist: the core of the structure is a list 
pointing to 65,536 8 KB bitmaps, where each bit indi- 
cates the presence of an IP address. Under normal cir- 
cumstances, only a small number of the bitmaps are 
allocated, and most ipsets end up being less than a 
megabyte in size. However, the structure is very fast 
(any address is looked up in two memory loads) and 
consequently allows a user to query arbitrary sets of IP 
addresses as fast as any other query in SiLK. 


An ipset can be built from an ASCII list of dot- 
ted-quad IP addresses using the buildset command. It is 
also possible to use the results from an rwfilter com- 
mand to generate an ipset by piping the output from 
rwfilter to rwset. The rwset command reads in data in the 
packed format and generates an ipset of either the 
source IP addresses or destination IP addresses, as 
specified by a command line option. 


The ipset files that result from buildset or rwset 
can be read using the readset command, which can 
print the IP addresses in the set, a count of the IP 
addresses, or various statistical information. Ipsets can 
also be combined using standard set functions such as 
intersect (setintersect) or union (rwset-union). 


Ipsets allow a user to filter data on IP addresses 
that need not have anything in common (e.g., does not 
need to be in a range). For example, a site can main- 
tain a “bad list” of IP addresses (addresses that are 
known to be malicious) as an ipset. This set can then 
be used to filter incoming traffic for any activity from 
these IPs, or to filter outgoing traffic to see if there is 
any communication to these IPs. If a second bad list 
needed to be added to the first, then the two could be 
merged using rwset-union. Similarly, to see what 
addresses the two lists had in common, setintersect 
could be used. 


In order to view the records returned by rwfilter or 
similar utilities (e.g., rwsort, which is described 
below), the command rwcut must be used. This com- 
mand reads in any packed data file, and prints the 
fields in the packed record, along with the sensor ID 
and the end time of the flow. The fields to be printed 
can be specified with the --field option, where the fields 
are numbered from one to 12. (Numbers were used to 
save the user from needing to type each required field 
in full, e.g., --field=sip,dip,sport,dport,stime. Additionally, 
once the user has memorized which numbers map to 
which fields, it allows them to easily specify ranges, 
e.g., --field=1-4,9.) The number of records to print can 
also be specified with the --num-recs option. 


All of the tools work on packed data, as this is 
the most efficient. Given the large number of records 
that need to be handled quickly, Unix-like utilities, 
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such as rwuniq and rwsort were developed that use 
packed data. The rwunig tool will return the unique 
entries for the specified field, along with a count, and 
SO is equivalent to the Unix command unig -c. In addi- 
tion, rwunig allows the user to set a threshold, so that 
only those entries that occur more often than the 
threshold are returned. The rwsort tool sorts packed 
data on the specified fields (allowing the user to spec- 
ify a primary key and secondary key), outputting the 
results in a packed data format. Operations using these 
tools on packed data perform more than twice as fast 
as the same operations on plain text using traditional 
UNIX tools. 


Currently, rwsort is limited to 50 million records. 
If the input contains more than 50 million records, sort 
proceeds based on just the first 50 million records. 
This limit was provided based on memory restrictions, 
and can be changed easily by modifying the source 
code. This design decision was made based on the 
assumption of 2 GB of RAM being available, and with 
the desire to provide the user with a consistent mem- 
ory limit, rather than one that might change based on 
machine or machine usage. 


Data Summarization 


Other tools are intended to assist in traffic analy- 
sis by providing summarizations appropriate for 
graphing, and various statistical reports. For example, 
rwcount provides the number of bytes, packets and 
flows seen in the packed data provided, broken into 
user-specified time intervals (e.g., five minutes, one 
hour). This allows a user to glance at a report and 
determine if there was any sudden spike in activity. 
The tool rwtotal allows even finer granularity based on 
user-specified criteria. For example, a user can per- 
form an rwfilter to extract all traffic going to a particu- 
lar /24 address space, then pipe this to rwtotal and 
group the results (number of bytes, packets and flows) 
by the last octet of the destination address. This pro- 
vides a count of all the traffic going to each IP address 
in a /24 network. Other than specifying various parts 
of the source or destination address (e.g., last 8 bits, 
last 16 bits), the user can also print results based on 
protocol, source port, destination port, number of 
packets or number of bytes. 


The tool rwaddrcount is similar to rwtotal, however 
it is based on IP addresses instead of time intervals. It 
can take as input the results from an rwfilter query, and 
will return the total number of bytes, packets and 
flows for each source IP address, along with the time 
stamps for the first and last flows observed. The infor- 
mation provided can be further refined through com- 
mand line options specifying the minimum and maxi- 
mum flows, packets or bytes observed. 

The tool rwstats computes a variety of statistics, 
based on the traffic flows provided to it. The number 
of flows, and the percent of the input that these flows 
represent, are provided for the groups specified by the 
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user. Top A lists (e.g., sort the results by the number of 
flow records, and then only return the first N, such as 
10, from the list — thereby presenting the results with 
the most traffic) can be provided by either source or 
destination IP address, where the user specifies the 
value for NV. In this manner, the user can view the top 
10 (for example) sources that generated the most num- 
ber of flows to the monitored network, or the top 20 
destinations that generated the most flows to outside 
addresses. This feature can also be used based on 
ports, rather than IP addresses. Additionally, the bot- 
tom N (those groups with the least number of flows) 
can also be specified. If preferred, the user can look at 
combinations of items, such as the top source-destina- 
tion IP address combinations or source-destination 
port pairs. Additional statistics, such as the minimum, 
maximum, quartile, and interval statistics for bytes, 
packets and bytes/packet can be determined based on 
protocol (e.g., TCP, UDP). 


In addition to the 13 tools provided by the SiLK 
Suite, there are also seven utilities. The utilities differ 
from the tools as they were provided to assist in some 
analysis tasks, based on user feedback. In contrast, the 
tools were designed to perform the actual analysis. 
The utilities are: 

1. num2dot: This utility converts IP addresses from 
a 32-bit integer to dotted quad notation. It 
expects output from rwcut (where the rwcut IP 
format had been specified to be 32-bit integers, 
rather than the default of dotted-quad), with the 
fields to be converted specified on the com- 
mand line. This tool is useful if output that con- 
tains IP information in both 32-bit integer and 
dotted-quad notation is desired. One example 
of where this would be useful is if the resulting 
flows needed to be imported into a spreadsheet. 
Using rwcut with --field=1,9,1-8 --integer-ips gen- 
erates rwcut output with IP numbers as 32-bit 
values. num2dot can then convert fields three 
and four to dotted-quad notation, leaving the 
first field (source IP address) as a 32-bit inte- 
ger, allowing easy sorting in the spreadsheet, 
yet still providing the dotted-quad value for the 
user (the third field would now contain the dot- 
ted-quad version of the source IP, while the 
fourth field contained the destination IP in dot- 
ted-quad). 

2. rwappend: This utility appends new flow 
records to an existing packed file. 

3. rweat: This utility will concatenate packed files 
into a single stream. 

4. rwfileinfo: This utility reads the header informa- 
tion of a packed file and prints it to the screen. 
This information includes items such as the 
number of records in the file and the command 
line that generated the file. 

5. nfiglob: This utility can be used to determine 
what files will be processed given a set of fglob 
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options (e.g., start date, incoming or outgoing, 
etc.). 

6. mapsid: This utility determines the sensor name 
or number, and can convert between the two 
representations. 

7. rwswapbytes: This utility can be used to change 
the endianness of a packed file. 


Security Analysis 


These tools can easily be scripted to deliver regu- 
lar reports. For example, one of the client sites using 
these tools produces top 10 lists on a nightly bases. 
The top 10 IP addresses that have seen the most flows, 
or bytes, or packets can easily be determined through a 
combination of rwfilter and rwaddrcount. Similar statis- 
tics can easily be generated for ports based on rwtotal. 


The following sections demonstrate some of the 
capabilities of the tool set to detect various types of 
activity. All IP addresses have been obfuscated. All 
internal addresses are represented as 10.x.x.x, while 
all external addresses are represented as 241.x.x.x. 
Access to the data may be made available by special 
request to the authors. 


Scanning Activity 


Adversaries often perform a scan of a network as 
the prelude to an attack. In particular, “script kiddies” 
(unskilled attackers) will often deploy an exploit 
across all of the machines in a network [7]. This type 
of activity will appear as a SYN scan (in the case of a 
TCP-based exploit), where there might be some fur- 
ther communication with internal systems that respond 
to the scanner with a SYN-ACK. The SiLK tool set 
can be used to find scanners of this type, and to deter- 
mine if particular machines should be investigated for 
compromise. 


For example, to look for “‘fast’’ scanners (that is, 
scanners who have contacted a large number of 
machines in a short amount of time), we can do the 
following: 

rwfilter --start=2004/6/29:17 \ 
--syn=1 --ack=0 --fin=0 \ 
--proto=6 --pass=stdout | \ 

rwaddrceount --print-recs \ 
--rec-min=65000 


The rwfilter command here uses incoming traffic 
(both web and non-web) by default. It processes one 
hour of data (17:00-18:00 GMT on June 29, 2004), 
looking for all flows where the SYN flag was set, and 
the ACK and FIN flags were not set. The other flags 
(RST, URG and PSH) can take any value. Only the 
TCP protocol is used (--proto=6, using the standard 
protocol numbers as defined by the Internet Assigned 
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Numbers Authority (IANA) [4]). The results from the 
rwfilter command are passed through stdout to the rwad- 
drcount command. This command prints all the source 
IPs that had more than 65000 flows, along with the 
number of bytes, packets and flows, with start and end 
times (see Figure 1). There were two sources that met 
these criteria. 


The source IP that had the most records (and who 
therefore presumably scanned the most targets) was 
241.27.240.226, with 74,773 flows (a little over one /16 
network, if each flow is to a different destination IP), 
while 241.21.21.24 had 65,732 flows. We therefore elect 
to examine the traffic from both source IPs in detail. 
Some of the information that we would like to know 
include how many unique destination IP addresses did 
each source target, and what ports they targeted. 


To answer the first question regarding the num- 
ber of destinations, we can extract all of the flows for 
each source via an rwfilter call. In this case we will save 
the results to disk so that we do not need to continu- 
ally process an entire hour of data. We also drop the 
restriction on the flag combinations so that we see all 
of the TCP flows from this source. The command used 
for the first source IP address is: 

rwfilter --start=2004/6/29:17 \ 
--saddr=241.27.240.226 \ 
--proto=6 \ 
--pass=rwdatafile 
The process is the same for the second IP address. The 
resulting file contains 75,199 records (obtained by 
using rwfileinfo, or by adding the option --print-stat to the 
rwfilter command). To determine the number of unique 
destination IP addresses in this file, we can use the 
command: 
rwuniq --field=2 --no-title \ 
rwdatafile | we 
which performs the equivalent of sort | unig -c using the 
destination IP field (field = 2), with the titles for the 
fields turned off. The result was 75,199 lines, which 
indicates that there were that many unique IP 
addresses — indicating only one flow per destination IP 
address. However, given that there were more records 
observed when the flag restriction was dropped, there 
was likely some further communication between some 
of the additional destination IP addresses. 


To determine the ports that were targeted, we can 
perform the same query as above, but replace the field 
value with 4 (for destination port). The result from this 
query (rwunig --field=4 rwdatafile) is: 


dPort count 
80 75199 


This shows that all of the flows were to destina- 
tion port 80 (web). Similarly, running the same 


IP Address Bytes |Packets|Records Start Time End Time 
241.21.21.24| 3855940 87635 65732|06/29/2004 17:00:00]06/29/2004 17:48:13 
241.27.240.226| 5267496| 109792 74773|06/29/2004 17:00:02|06/29/2004 17:49:40 














Figure 1: The results from rwaddrcount. 
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commands for source IP address 241.21.21.24 also 
showed a scan of port 80. 


We are interested in determining if there were 
any responses to these two scans. To determine this, 
we first create an ipset from these two IPs. We can do 
this by creating a file that contains the two IP 
addresses and then running buildset. Alternatively, we 
can do the following: 

rwfilter --start=2004/6/29:17 \ 
--syn=1 --ack=0 --fin=0 \ 
--proto=6 --pass=stdout | \ 
rwaddrcount --print-ip \ 
--rec-min=65000 --no-title | \ 
buildset stdin ip.set 
With only two IP addresses, it is quicker to just create 
a temporary file, however if there had been a large 
number of IP addresses, than the latter approach is 
preferable. 


We can then use the ipset that we have created as 
a filter on the outgoing data to determine what com- 
munication there was from the internal network to 
these two scanning IP addresses. The command we 
would use is: 
rwfilter --type=out,outweb \ 
--start=2004/6/29:17 --proto=6 \ 
--dipset=ip.set --pass=rwdata.out 


The result was a file consisting of 17,542 records. This 
indicates a very large number of responses! However, 
we are only interested in positive responses, or records 
where there was no RST returned to the source. The 
rwdata.out file can be further filtered on the flag combi- 
nations, to examine only those flows that contained no 
RST using the command: 
rwfilter --rst=0 rwdata.out \ 
--pass=rwdata.out.noRST 


Unfortunately, this still resulted in 16,865 records. We 

therefore take a quick look at the data to see if we can 

determine anything interesting. We do this by first 

sorting on the source IP address (the internal respond- 

ing host), followed by displaying the results: 

rwsort --field=l rwdata.out.noRST \ 

rwcut --field=1,2,3,4,6,7,8 \ 
less 

A sample from the result set is given in Figure 2. This 

shows that the scanner was proceeding in order 

through the IP space. In this instance, the source had 


sIP dIP|sPort|dPort 
10..10.10..1) 241 37.150: 226 80} 1542 


10,10.10.2| 241 .37.150.226 80} 1543 
1G; 101053) 241 .37<150. 226 80} 1544 
LQ.10.10,4| 241 .3/.cd50..226 80} 1545 
10::70210.5) 241 37.,150.226 80} 1546 
L0:.10..10..6| 2415.37 150.226 80} 1547 
10:530..1037)| 241 <3 7.550;.226 80} 1548 
10.20.10. 8 241 34.1 90.226 80} 1549 
10510.10.9] 241.37.150..226 80} 1550 
10.10.10.10} 241.37.150.226 SO) 1531 
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actually stumbled onto a honey-pot, which is why 
there was a response from each IP address in that par- 
ticular subnet. In general, if an unusually high number 
of the same service is seen on the same subnet (e.g., 
16000 web servers on a/16) where the subnet is a gen- 
eral network (that is, not a server farm, for example), 
then it might indicate a honey pot or a firewall (as 
some firewalls can be configured to respond in this 
manner). This hypothesis is further supported by each 
server responding with exactly seven packets and 
1646 bytes, implying that they are returning the same 
content (or at least content that is exactly the same 
size!). In our case, it turns out that the majority of 
responses to the scan were due to this honeypot. 


Worm Attacks 


Recently, two prominent worms (Korgo [9] and 
Sasser [10]) have been released that scan port 445. 
When a vulnerable machine is found, each of the 
worms exploits the vulnerability, but then diverge to 
perform different activities on the infected machine. 
As we care less about external machines scanning our 
network for vulnerabilities than we do about internal 
machines that have been infected, we turn our attention 
to examining outgoing network traffic. We know that 
infected machines scan for vulnerabilities on port 445, 
sO We Can narrow our search by examining only flows 
with destination port 445. Since we are looking for 
machines that perform scanning of this port, by defini- 
tion there will be a large number of unique destination 
IP addresses contacted by a single source. We there- 
fore want to find all internal machines that are contact- 
ing large numbers of external machines on port 445. 
(Note that just a large number of flows is not necessar- 
ily indicative of an infection, but that a large number 
of unique destination IP addresses is more indicative.) 


To extract the information we want, we first per- 
form an rwfilter, and then pipe these results through a 
call to rwstats: 


rwfilter --type=out \ 
--start=2004/6/29:17 \ 
--proto=6 --dport=445 \ 
--pass=stdout | \ 
rwstats --pair-topn=10 


AS we are now examining outgoing traffic, we 
need to specify that the type is out instead of the 
default of incoming. We do not need to examine 


packets bytes flags 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
7 1646| FS PA 
6 1152] FS PA 


Figure 2: Output from filtering on a particular destination IP. 
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This shows that four IP addresses contacted more 
than 100 unique destinations in a single hour, which is 
an unusually high number of destinations. (It has been 
observed by Williamson that workstations usually 
contact no more than ten destination IP addresses per 
hour [13].) These four machines therefore warrant 
additional investigation as they might be infected with 
Sasser or Korgo (or some other worm or virus). The 
only IP address that shows up in both top ten lists — 
that of number of connections to unique destination IP 
addresses and that of number of flows between it and 
some other source — is 10.150.100.100. 


SYN Flooding 


Another security concern is denial of service 
attacks. One of the common network-based denial of 
service attacks is SYN flooding. We can use commands 
similar to those used to detect worms to detect if a 
SYN flood has occurred. In this case, we want to detect 
all source-destination IP pairs that have seen an exces- 
sive number of SYN packets. To do this, we first filter 
on all incoming traffic for flows with the SYN bit set, 
but with no ACK or FIN, examining only the TCP pro- 
tocol. We then run rwstats on the result, looking for the 

: = source-destination pairs that have the most flows. In 
-pass=stdout | \ f : 
rwstats --pair-top-threshold=1 \ act, we can specify that at least some X number of 
gawk -F"|" *{print $1}’ | sort | \ flows are required before we consider this a SYN flood 
uniq -c | sort -nr | head that we want to investigate. In this case, we choose 
X = 1000, resulting in the following command: 


outweb, as port 445 is not one of the web ports. Again 
we look at only one hour of data, extracting all traffic 
to destination port 445 using the TCP protocol. The 
output from this command is piped into rwstats, which 
produces the top ten source-destination IP pairs based 
on the number of records. The output from this com- 
mand is give in Figure 3. 


This is not exactly the output that we want, since 
we want the sources that have contacted the most desti- 
nations, not the source-destination pairs that have the 
most flow records. To get this information, we can 
specify a threshold on the number of flows that a 
source-destination pair must have before printing it to 
the screen. By specifying a threshold of one, we extract 
all pairs. However, this still only provides a list of all 
pairs, along with information about each pair such as 
the number of flow records. We can take this informa- 
tion and pipe it through some standard unix utilities to 
extract, for example, the ten sources who contacted the 
most destinations. The command to do this is: 


rwfilter --type=out \ 
--start=2004/6/29:17 \ 
--proto=6 --dport=445 \ 


The results from this command are: 


78443 10.101.100.10 
16083 10.123.100.100 
940 10.150.100.100 
127 10.0152-100,. 100 
92: 10.20,,10.100 
43° 110.,20:..1..20 
12-10, 1171.20.50 
9 10.30.100.40 
G6 10 547-7).30:.30 
5 10,199.100. 60 


INPUT SIZE: 127393 records 
SOURCE IP/DEST IP PAIRS: Top 10 of 95825 unique 


rwfilter --syn=1 --ack=0 \ 
--fin=0 \ 
--start=2004/6/29:17 \ 
--pass=stdout \ 
--proto=6 | \ 
rwstats --pair-top-threshold=1000 
This produces the result shown in Figure 4. 


This example shows that there was one SYN 
flood that occurred during the hour that was examined. 
We can look at the flows in detail by using the command: 


src_ip_addr dest_ip_addr num_pairs| %_of_input cumul_% 
L0..1400.. 140 241.21.208.42 99 0.077712%| 0.077712% 
10.10: 10E0 241 22 OF. 159 52 0.040819%}| 0.118531% 
10.110.100.10 241.240.17.204 22 0.017269%} 0.135800% 
10::120..100 «20 241.241.17.204 21 0.016484%| 0.152285% 
10.10% dad 241.242.1.51 14 0.010990%| 0.163274% 
10.130.100.100 {241.243.200.199 10 0.007850%}] 0.171124% 
10.10.1.10 241.244.187.97 10 0.007850%| 0.178974% 
10.140.10.100 241.245.231.202 8 0.006280%}| 0.185254% 
10.150.100.100 |241.23.240.114 5 0.003925%| 0.189178% 
10.150.100.100 |241.24.128.179 5 0.003925%|} 0.193103% 
Figure 3: Output from rwstats --pair-topn=10. 
INPUT SIZE: 4477703 records 
SOURCE IP/DEST IP PAIRS: Top 30 of 3953344 unique 
src_ip_addr dest_ip_addr num_pairs| %_of_input cumul_% 





241.240.220.58 {10.100.100.100 


20893 
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0.466601% 
Figure 4: Output from rwstats --pair-top-threshold=1000. 
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--saddr=241.240.220.58 \ 
--daddr=10.100.100.100 \ 
--start=2004/6/29:17 \ 
--pass=stdout | \ 

rwsort --field=9 | \ 

rweut --field=3-8 | less 


rwfilter 


This command filters on the particular source and desti- 
nation IP address of interest for the one hour, followed 
by sorting the records based on the start time for the 
flow. A sample of the results from this command are: 


sPort|dPort|pro| packets bytes| flags 
54237|17299 6 1 60 S 
54232 )38318 6 ] 60 S 
54235 |62020 6 1 60 S 
54238 | 46925 6 1 60 S 
54239| 23970 6 1 60 S 
54240} 3568 6 1 60 S 
54233 )}43740 6 ] 60 S 
54228|14472 6 1 60 S 
54241|)17440 6 1 60 S 


This is an unusual set of traffic in that it appears 
that the attacker was flooding a particular machine, 
rather than a specific service. It is also unusual for the 
TCP SYN packet to contain 60 bytes. Further, it 
appears that the DoS was directed against only high- 
numbered ports. 


To determine if there was any variation in the 
protocol, packets, bytes or flags, we run: 
rwfilter --saddr=241.240.220.58 \ 
--daddr=10.100.100.100 \ 
--start=2004/6/29:17 \ 
--pass=stdout | \ 
rwuniq --field=6 
In this case, we are looking at how many different num- 
bers of packets (field=6) appear in the set of flows. By 
varying the field value, we can also examine bytes, 
flags, and protocol. In this case we found that all of the 
flows were |-packet TCP SYN flows consisting of 60 
bytes. By choosing field=4 for destination port, and 
then piping the result through sort and wc, we found that 


Date Records 
01/26/2004 07:40:00 5.00 45 
01/26/2004 07:50:00 500 34 
01/26/2004 08:00:00 6.00] 470788 
01/26/2004 08:10:00 9.00 932 
01/27/2004 20:40:00 9.00 61 
01/27/2004 20:50:00 9240.00 7862 
01/27/2004 21:00:00 1010.00 905 
01/27/2004 21:10:00 34569.00 27883 
01/27/2004 21:20:00 28810.00 23265 
01/27/2004 21:30:00 9039.00 7350 
01/27/2004 21:40:00 7.00 158 


Bytes 
08.00 
68.00 
33.00 
13,00 
52.00 
57.00 
80.00 
88.00 
38.00 
54.00 
42.00 


Gates, et al. 


13915 unique ports were targeted, with no port being 
hit more than three times. 


Infected Machines 


Another example usage comes from tracking the 
MyDoom worm in late January, 2004. This worm 
spread via an email attachment that created a backdoor 
on ports 3127-3198. After the release of this worm, 
scanning for this backdoor increased significantly. To 
see the number of flows caused by this scanning in 
10-minute intervals (indicated by --bin-size=600, for 
600 seconds) over the 26-27 January 2004, we use the 
commands: 


rwfilter --start-date=2004/1/26:00 \ 
--end-date=2004/1/27:23 \ 
--dport=3127 --proto=6 \ 
-—type=in --pass=stdout | \ 
rweount --bin-size=600 


Note that we use only the incoming non-web 
data, and not the web data. This is because port 3127 
can be chosen as an ephemeral port for web connec- 
tions, which is benign traffic that we want to exclude. 
The rwfilter command processed 354,559,695 records, 
generating output that consisted of only 104,376 
records to be processed by rwcount. In this case, we 
use the default binning of rwcount, which is to put the 
flow in the bin based on its start time, regardless of the 
elapsed time of the flow. For example, if a flow con- 
sisted of 10,000 bytes over 20 minutes, then all 10,000 
bytes would be counted in the first 10-minute bin 
(based on start time), rather than 5000 bytes in the first 
10-minute bin, and 5000 bytes in the second 
10-minute bin. One of the options provided with 
rwcount will split the bytes and packets evenly over the 
time period covered by the record. This could result in 
fractional value (and hence we provide two digits after 
the decimal place for precision in the output). A snap- 
shot of some of the result is provided in Figure 5. 


Two interesting events occur in this data. The first 
is a sudden jump in the number of bytes transferred, 


Packets 
21.00 
63.00 

36509.00 
123.00 


63.00 
14840.00 
1683.00 
53526.00 
44585 .00 
14112.00 
101.00 


Figure 5: Output from filtering on destination port 3127 and then looking at the number of bytes, packets and flows 


in 10 minute intervals. 


rwfilter --stime=2004/1/26:08:00:00- 2004/1/26:08:10:00 dport.3127 \ 


--pass=stdout | rwcut 


Figure 6: Filtering on ten minute interval. 
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even though the number of flows remained constant. 
Drilling down to investigate further, we filter on the 10 
minute interval and then print the resulting records; see 
Figure 6. There was one flow in this time period that 
accounted for the majority of bytes, which was a trans- 
fer from port 119, which contains the network news 
protocol, but also the Happy99 trojan [12]. However, 
by going to the source IP address, we find that it is a 
news server, and so this traffic is likely legitimate. 


The second interesting event is the sudden jump 
in the number of records, which likely represents scan- 
ning activity against our network. We can determine 
which source IP addresses had the most flows associ- 
ated with them by using the command in Figure 7. 
This command prints all IP addresses that had more 
than 10 flows in the one hour time period. There were 
only two IP addresses that met this criterion, one of 
which had 12 flows, and the second of which had 
82,639. It is therefore likely that this second IP was 
performing a scan of our network. 


It is interesting to determine if there was any 
traffic that was returned to the scanning IP address. To 
do this, we filter all outgoing traffic on the scanning IP 
address as a destination (here, we represent this IP as 
241.2.3.4); see Figure 8. If there had been multiple 
scanning IP addresses, we could perform the same 
operation by creating an ipset first and then filtering 
on this set. We now have a file that contains all of the 
return traffic to the (potential) scanner(s). 


Examining this file more closely, we find 2658 
flows. We are particularly interested in flows that do 
not consist of only a RST-ACK. To determine if any 
flows meet this criteria, we can filter on all flows that 
contain just a RST-ACK, and then look at those flows 
that fail this filter: 


rwiiltér --ret=1 --ack=1 \ 
--urg=0 --psh=0 \ 
--syn=0 --fin=0 \ 
--proto=6 response.scanners \ 
--fail=stdout | \ 
rweut --fields=1-8 | less 


There are only 10 records that fail this query. Fortu- 
nately, all 10 records were ICMP error messages, and 
sO we can conclude that no internal machines had the 
trojan running. 


Comparison to Related Work 


The work that is the most closely related to this 
work is that of OSU FlowTools, developed by Fullmer 
and Romig [3]. The OSU FlowTools is a great toolkit, 
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and we had initially investigated using it. However, it 
was not capable of processing the amount of data we 
had in the time required, nor did it compress informa- 
tion sufficiently to minimize disk space. While Flow- 
Tools has continued to be developed over the past two 
years (with the latest release appearing to be Decem- 
ber 2003), increasing the efficiency of processing 
flows or storing to diskspace has not been a priority. 
Indeed, for the majority of networks, OSU FlowTools 
is more than sufficient. However, our needs corre- 
spond to providing analysis tools for a large ISP, 
where long-term trending as well as short-term secu- 
rity analysis were requirements. We therefore devel- 
oped our own flow packing system with performance 
and disk space minimization as design goals. To main- 
tain information on 1.5 billion flows requires approxi- 
mately 30 Gb of disk space. Additional space savings 
can be obtained through compression. (Saving this 
information as raw NetFlow records requires approxi- 
mately 67 GB.) These flows can be processed (via 
rwfilter) in 21 minutes on a Sun 4800. 


Many of the basic tools we provide are the same 
as in the OSU FlowTools, such as the ability to filter 
flows on ports or addresses, or to perform some level 
of statistical analysis. However, OSU FlowTools relies 
on Unix utilities for items such as sorting and 
uniq’ing, while we have developed utilities that per- 
form these operations on the raw data. By using these 
customized utilities, the performance increases signifi- 
cantly. For example, we can sort 45,433,086 records in 
five minutes, instead of 11.5 minutes required to sort 
the ASCII output. 


One of the capabilities that we do provide, that 
appears to be missing in OSU FlowTools, is that of 
ipsets. This provides a user with the ability to generate 
any arbitrary list of IP addresses (such as a list of 
known scanners, or known hostile hosts, or key inter- 
nal servers) and use this list in an efficient manner as a 
filter option. This functionality has proven to be par- 
ticularly useful for security analysis. For example, ear- 
lier we showed how to use ipsets to store a list of 
scanning IP addresses, which we can then use to filter 
outgoing data to search for SYN-ACK responses to 
these scans, which might indicate potential compro- 
mises. Assume that there were 1000 scanners in whom 
we were interested (rather than just the two in the 
example). OSU FlowTools would require the user to 
create an acl file with the IP addresses of interest in it 
in order to achieve the desired filtering. In contrast, we 
can generate the ipset of interest from the first rwfilter, 
and use this to then filter the outgoing data. Again, our 


rwfilter --stime=2004/1/27:20:40:00- 2004/1/27:21:40:00 dport.3127 \ 
--pass=stdout | rwaddrcount --print-rec --rec-min=10 


Figure 7: Finding IP addresses with most flows. 


rwfilter --start-date=2004/1/27:20 --end-date=2004/1/27:21 
--class=out --type=in --daddr=241.2.3.4 --pass=response.scanners 


Figure 8: Filter by scanning IP address. 
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approach has been optimized for performance (using a 
tree rather than a linear list), so that there is no reduc- 
tion in filtering speed as the number of IPs in the ipset 
increases. Another example would be a “bad list,” 
containing the IP addresses of external hosts who are 
known to have exhibited malicious activity in the past. 
The bad list can be represented as an ipset, and then 
the incoming data can be filtered on the bad list IPs as 
sources. Similarly the outgoing data can be filtered 
with the bad list as destinations. If we receive a bad 
list from another site that we wish to merge with our 
own, we need only do an rwset-union to combine the 
two sets into one. 


In addition, we provide the ability to extend the 
filtering capabilities of rwfilter through the use of 
dynamic libraries. Using this approach, administrators 
can program their own queries for cases where their 
query is too complex for the current filtering options. 
One example of where a dynamic library is useful is in 
examining flow traffic for particular patterns of activ- 
ity. For example, one sign of a successful buffer over- 
flow is that a source first contacted a server S on port 
P, and that this was then followed with a subsequent 
communication from the source to server S but on port 
R (e.g., the first flow represents 241.9.9.9 > 10.8.8.8:80, 
and the second flow is 241.9.9.9 > 10.8.8.8:5483). 
Every time a flow showed a connection with more 
than one 40-byte packet to port 80 on some destina- 
tion, then the information could be stored in a hash ta- 
ble with the source and destination IPs as the key. This 
hash table would be checked each time a flow was 
encountered that did not meet the above condition. If 
such a match was found, then the entry in the hash ta- 
ble would be marked. Once all records were pro- 
cessed, all marked entries in the hash table could be 
printed. To the best of our knowledge, none of the 
other flow tools provide this capability. 


Another useful capability that is provided by the 
SiLK Suite is rwfileinfo, which allows a user to deter- 
mine information about a packed file. What is particu- 
larly useful about this command is that it will return 
the arguments that were provided to rwfilter in order to 
generate the data file. 


Conclusions and Future Work 


We have presented a new suite of tools for sav- 
ing and analyzing NetFlow data. The tools provided 
were built with network security analysis in mind, and 
can be easily extended by a knowledgeable C pro- 
grammer through both the creation of new tools and 
the incorporation of dynamic libraries. The tools were 
specifically designed for use on very large and very 
busy networks, and so had fast execution and minimal 
disk space usage as design requirements. 


We have completed the collection system and 
provided basic analysis tools. We now intend to sup- 
plement these capabilities by providing tools that 
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allow traffic descriptions. One example of this is bags, 
which will be provided in an upcoming open source 
release. Bags are similar to ipsets, except that rather 
than using a single bit to indicate if an IP address has 
been seen, it provides a 32-bit counter that counts the 
number of flows seen to/from each IP address. This 
allows a user to ask questions such as ‘‘What IP 
addresses saw only one flow in the past hour?” and 
‘““How many IP addresses saw more than 10,000 flows 
in the past day?” Tools such as these will allow an 
administrator to characterize their network in cases 
where they might not otherwise have the authority or 
insight to do so (e.g., such as in the cases of large 
ISPs). Bags will be extended in a future release to be 
even more generic, counting any type of “volume”’ 
characteristic (e.g., flows, bytes, packets). 


In addition, we intend to provide the ability to 
perform stateful queries. For example, we are working 
on an rwmatch tool, which will match flows from two 
sets of data based on a specific attribute. For example, 
we could filter all incoming flows to a particular port 
(e.g., TCP 135) into one file, generating the ipset for 
the sources at the same time. We could then use this 
ipset to filter all outgoing traffic to an ephemeral port 
(> 1024), and save the resulting data. rnwmatch would 
use the two output files, and match on the IP addresses 
(destination in one direction matching with source in 
the other direction). This would provide an aggregated 
flow record indicating the traffic in both directions in a 
single record. This would allow an administrator to see 
all relevant data at once (e.g., the number of bytes and 
packets in each direction, for example), rather than 
needing to manually eyeball two different data files. 


The current tool suite has already been in opera- 
tional use at a large site for over a year and is currently 
used by several different organizations. Additionally, 
extensions have been coded that have been used in secu- 
rity publications. Two papers have been written that 
make use of this tool set, with some custom-coded 
extensions. McHugh [6] provides a good explanation of 
how to use the functionality of IP sets, along with the 
bags extension, for security analysis. Collins and Reiter 
[2] have used the SiLK tool set in performing an analy- 
sis of denial of service (DoS) traffic-filtering approaches. 
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ABSTRACT 


Log analysis is an important way to keep track of computers and networks. The use of 
automated analysis always results in false reports, however these can be minimized by proper 
specification of recognition criteria. Current analysis approaches fail to provide sufficient support 
for the recognizing the temporal component of log analysis. Temporal recognition of event 
sequences fall into distinct patterns that can be used to reduce false alerts and improve the 
efficiency of response to problems. This paper discusses these patterns while describing the 
rationale behind and implementation of a ruleset created at the CS department of the University of 
Massachusetts at Boston for SEC — the Simple Event Correlation program. 


Introduction 


With today’s restricted IT budgets, we are all try- 
ing to do more with less. One of the more time con- 
suming, and therefore neglected, tasks is the monitoring 
of log files for problems. Failure to identify and resolve 
these problems quickly leads to downtime and loss of 
productivity. Log files can be verbose with errors hid- 
den among the various status events indicating normal 
operation. For a human, finding errors among the rou- 
tine events can be difficult, time consuming, boring, 
and very prone to error. This is exacerbated when 


aggregating events, using a mechanism such as syslog, 


due to the intermingling of events from different hosts 
that can submerge patterns in the event streams. 


Many monitoring solutions rely on summarizing 
the log files for the previous days logs. This is very 
useful for accounting and statistics gathering. Sadly, if 
the goal is problem determination and resolution then 
reviewing these events the day after they are generated 
is less helpful. Systems administrators cannot proac- 
tively resolve or quickly respond to problems unless 
they are aware that there is a problem. It is not useful 
to find out in the next morning’s summary that a pri- 
mary NFS server was reporting problems five minutes 
before it went down. The sysadmin staff needs to dis- 
cover these problems while there is still time to fix the 
problem and avert a catastrophic loss of service. 


The operation of computers and computer net- 
works evolves over time and requires a solution to log 
file analysis that address this temporal nature. This 
paper describes some of the current issues in log anal- 
ysis and presents the rationale behind an analysis rule 
set developed at the Computer Science Department at 
the University of Massachusetts at Boston. This rule- 
set is implemented for the Simple Event Correlator 
(SEC), which is a Perl based tool designed to perform 
analysis of plain text logs. 


Current Approaches 


There are many programs that try to isolate error 
events by automatically condensing or eliminating routine 
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log entries. In this paper, I do not consider interactive 
analysis tools like MieLog [TakadaQ2]. I separate 
automatic analysis tools into offline or batch monitor- 
ing and on-line or real-time monitoring. 


Offline Monitoring 


Offline solutions include: logwatch [logwatch], 
SLAPS-2 [SLAPS-2], or Addamark LMS [Sah02]. 
Batch solutions have to be invoked on a regular basis 
to analyze logs. They can be run once a day, or many 
times an hour. Offline tools are useful for isolating 
events for further analysis by real time reporting tools. 
In addition they provide statistics that allow the sys- 
tem administrator to identify the highest event sources 
for remedial action. However, offline tools do not pro- 
vide the ability to provide automatic reactions to prob- 
lems. Adam Sah in discussing the Addamark LMS 
[Sah02, p. 130] claims that real-time analysis is not 
required because a human being, with slow reaction 
times, is involved in solving the problem. I disagree 
with this claim. While it is true that initially a human 
is required to identify, isolate and solve the problem, 
once it has been identified, it is a candidate for being 
automatically addressed or solved. If a human would 
simply restart apache when a particular sequence of 
events occur, why not have the computer automati- 
cally restart apache instead? Automatic problem 
responses coupled with administrative practices can 
provide a longer window before the impact of the 
problem is felt. An “out of disk space” condition can 
be addressed by removing buffer files placed in the 
file system for this purpose. This buys the system 
administrator a longer response window in which to 
locate the cause of the disk full condition minimizing 
the impact to the computing environment. 


Most offline tools do not provide explicit support 
for analyzing the log entries with respect to the time 
they were received. While they could be extended to 
try to parse timestamps from the log messages, this is 
difficult in general, especially with multiple log files 
and multiple machines, as ordering the events requires 
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normalizing the time for all log messages to the same 
timezone. Performing analysis on log files that do not 
have timestamps eliminates the ability of these batch 
tools to perform analysis by time. Solutions such as 
the Addamark LMS [Sah02] parse and record the 
generation time, but the lack of real-time event-driven, 
as opposed to polled, triggers reduces its utility. 


Online Monitoring 


Online solutions include: logsurfer [logsurfer], 
logsurfer+ [logsurfer+], swatch [swatch, Hansen1993], 
2swatch [2swatch], SHARP [Bing00], ruleCore 
[ruleCore], LoGS [LoGS] and SEC [SEC]. All of these 
programs run continuously watching one or more log 
files, or receiving input from some other program. 


Swatch is one of the better known tools. Swatch 
provides support for ignoring duplicate events and for 
changing rules based on the time of arrival. However, 
swatch’s configuration language does not provide the 
ability to relate arbitrary events in time. Also, it lacks 
the ability to activate/deactivate rules based on the 
existence of other events other than suppressing dupli- 
cate events using its throttle action. 


Logsurfer dynamically changes its rules based 
on events or time. This provides much of the flexibil- 
ity needed to relate events. However, | found its syn- 
tax difficult to use (similar to the earliest 1.x version 
of SEC) and I never could get complex correlations 
across multiple applications to work properly. The 
dynamic nature of the rules made debugging difficult. 
I was never able to come up with a clean, understand- 
able, and reliable method of performing counting oper- 
ations without resorting to external programs. Using 
SEC, I have been able to perform all of the operations 
I implemented in logsurfer with much less confusion. 


LoGS is an analysis program written in Lisp that 
is still maturing. While other programs create their 
own configuration language, LoGS’s rules are also 
written in Lisp. This provides more flexibility in 
designing rules than SEC, but may require too much 
programming experience on the part of the rule 
designers. I believe this reduces the likelihood of its 
widespread deployment. However, it is an exciting 
addition to the tools for log analysis research. 


The Simple Event Correlator (SEC) by Risto 
Vaarandi uses static rules unlike logsurfer, but pro- 
vides higher level correlation operations such as 
explicit pair matching and counting operations. These 
correlation operations respond to a triggering event 
and persist for some amount of time until they time- 
out, or the conditions of the correlation are met. SEC 
also provides a mechanism for aggregating events and 
modifying rule application based on the responses to 
prior events. Although it does not have the dynamic 
rule creation of logsurfer, I have been able to easily 
generate rules in SEC that provide the same function- 
ality as my logsurfer rules. 
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Filter In vs. Filter Out 


Should rules be defined to report just recognized 
errors, or should routine traffic be eliminated from the 
logs and the residue reported? There appear to be 
people who advocate using one strategy over the other. 


I claim that both approaches need to be used and 
in more or less equal parts. I am aware of systems that 
are monitored for only known problems. This seems 
risky as it is more likely an unknown problem will 
sneak up and bite the unwary systems administrator. 
However, very specific error recognition is needed 
when using automatic responses to ensure that the best 
solution is chosen. Why restart Apache if killing a 
stuck CGI program will solve the problem? 


Filtering out normal event traffic and reporting 
the residue allows the system administrator to find sig- 
natures of new unexpected problems with the system. 
Defining “‘normal traffic” in such a way that we can 
be sure its routine is tricky especially if the filtering 
program does not have support for the temporal com- 
ponent of event analysis. 


Event Modeling 


Modeling normal or abnormal events requires 
the ability to fully specify every aspect of the event. 
This includes recognizing the content of the event as 
well as its relationship to other events in time. With 
this ability, we can recognize a composite or corre- 
lated event that is synthesized from one or more primi- 
tive events. Normal activity is usually defined by these 
composite events. For example a normal activity may 
be expressed as: 


‘sendmail -q’ is run once an hour by root at 31 
minutes after the hour. It must take less than one 
minute to complete. 


> CMD: /usr/lib/sendmail -q 
> root 25453 c Sun May 23 03:31:00 2004 
€ root 25453 c Sun May 23 03:31:00 2004 


Figure 1: Events indicating normal activity for a 
scheduled cron job. 


This activity is shown by the cron log entries in Figure 
| and requires the following analysis operations: 

e Find the sendmail CMD line verifying its 

arrival time is around 31 minutes after the hour. 
If the line does not come in, send a warning. 
The next line always indicates the user, process id 
and start time. Make sure that this line indicates 
that the command was run by root. 
The time between the CMD line arrival and the 
final line must be less than one minute. Because 
other events may occur between the start and end 
entries for the job, we recognize the last line by 
its use of the unique number from the second 
field of the second line. 


This simple example shows how a tool can ana- 
lyze the event log in time. Tools that do not not allow 
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the specification of events in the temporal realm as 
well as in the textual/content space can suffer from the 
following problems: 

e matching the right event at the wrong time. 
This could be caused by an inadvertent edit of 
the cron file, or a clock skew on the source or 
analyzing host. 

® not noticing that the event took too long to run. 

© not noticing that the event failed to complete at 
all. 


Temporal Relationships 


The cron example mentions one type of temporal 
restriction that I call a schedule restriction. Schedule 
restrictions are defined by working on a defined 
(although potentially complex) schedule. Typical 
schedule restrictions include: every hour at 31 minutes 
past the hour, Tuesday morning at 10 a.m., every 
weekday morning between | and 3 a.m. 


In addition to schedule restrictions, event recogni- 
tion requires accounting for inter-event timing. The 
events may be from a single source such as the 
sequence of events generated by a system reboot. The 
statement that the first to last event in a boot sequence 
should complete in five minutes is an inter-event timing 
restriction. Also, events may arise from multiple 
sources. Multi-source inter-event timing restrictions 
might include multiple routers sending an SNMP 
authentication trap in five minutes, or excessive “con- 
nection denied” events spread across multiple hosts and 
multiple ports indicating a port scan of the network. 


These temporal relationships can be explicit 
within a correlation rule: specifying a time window for 
counting the number of events, suppressing events for 
a specified time after an initial event. The timing rela- 
tionship can also be implicit when one event triggers 
the search for subsequent events. 


Event Threading 


Analysis of a single event often fails to provide a 
complete picture of the incident. In the example 
above, reporting only the final cron event is not as 
useful as reporting all three events when trying to 
diagnose a cause. Lack of proper grouping can lead to 
underestimating the severity of the events. Consider 
the following scenario: 

1. A user logs in using ssh from a location that 
s/he has never logged in from before. 

2. The ssh login was done using public key 
authentication. 

3. The ssh session tries to open port 512 on the 
server. It is denied. 

4. Somebody tries to run a program called 
“crackme” that tries to execute code on the 
stack. 

5. The user logs out. 


Looking at this sequence implies that somebody 
broke in and tried to execute an unsuccessful attempt 
to gain root privileges. However, in looking at individ- 
ual events, it is easy to miss the connections. Taken in 
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isolation, each event could be easily dismissed, or 
even filtered out of the reports. Reporting them as dis- 
crete events, as many analysis tools do, may even con- 
tribute to an increased chance of missing the pattern. 
Taken together they indicate a problem that needs to 
be investigated. A log analysis tool needs to provide 
some way to link these disparate messages from dif- 
ferent programs into a single thread that paints a pic- 
ture of the complete incident. 


Missing Events 


Log analysis programs must be able to detect 
missing log events [Finke2002]. These missing events 
are critical errors since they indicate a departure from 
normal operation that can result in a many problems. 
For example, cron reports the daily log rotation at 
12:01 a.m. If this job is not done (say, because cron 
crashed), it is better to notice the failure immediately 
rather than three months later when the partition with 
the log files fills up. 


The problem with detecting missing events is that 
log monitoring is — by its nature — an event-driven 
operation: if there is no event, there is no operation. 
The log analysis tool should provide some mechanism 
for detecting a missing event. One of the simpler ways 
to handle this problem is to generate an event or action 
on a regular basis to look for a missing event. An 
event-driven mechanism can be created using external 
tools such as cron to synthesize events, but I fail to see 
a mechanism that the log analysis tool can use to detect 
the failure of the external tool to generate these events. 


Handling False Positives/False Negatives 


A false negative occurs when an event that indi- 
cates a problem is not reported. A false positive results 
when a benign event is reported as a problem. False 
negatives impact the computing environment by failing 
to detect a problem. False positives must be investi- 
gated and impact the person(s) maintaining the comput- 
ing environment. A false positive also has another dan- 
ger: It can lead to the “boy who cried wolf” syndrome, 
causing a true positive to be ignored as a false positive. 


Two scenarios for generating false negatives are 
mentioned above. Both are caused by incorrectly spec- 
ifying the conditions under which the events are con- 
sidered routine. 


False positives are another problem caused by 
insufficiently specifying the conditions under which 
the event is a problem. In either case, it may not be 
possible to fully specify the problem conditions 
because: 

e Not all of the conditions are known. 
e Some conditions are not able to be monitored 
and cannot be added to the model. 


It may be possible to find correlative conditions 
that occur to provide a higher degree of discrimination 
in the model. These correlative events can be use to 
change the application of the rules that cause the false 
positive to inhibit the false report. 
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To reduce these false positives and false nega- 
tives, the analysis program needs to have some way of 
generating and receiving these correlative events. 


While it is impossible to eliminate all false 
reports, by proper specification of event parameters, 
false reports can be greatly reduced. 


Single vs. Multiple Line Events 


Programs can spread their error reports across 
multiple lines in a logfile. Recognizing a problem in 
these circumstances requires the ability to scan not 
just a single line, but a series of lines as a single 
instance. The series of lines can be treated as individ- 
ual events, but key pieces of information needed to 
trigger a response or recognize an event sequence may 
occur on multiple lines. Consider the cron example of 
Figure 1: the first two lines provide the information 
needed to determine that it is an entry for sendmail 
started by root, and the process id is used in discover- 
ing the matching end event. Handling this multi-line 
event as multiple single line events complicates the 
rules for recognizing the events. 


Multi-line error messages seem to be more 
prevalent in application and device logs that do not 
use the Unix standard syslog reporting method, but 
some syslog versions split long syslog messages into 
multiple parts when they store them in the logfile. For- 
tunately, when I have seen this happen, the log lines 
always occur adjacent to one other without any inter- 
vening events from other sources. This allows recog- 
nition provided that the split does not occur in the 
middle of a field of interest. 


With syslog and other log aggregation tools, a 
single multi-line message can be distorted by the 
injection of messages from other sources. The logs 
from applications that produce multi-line messages 
should be directed to their own log file so that they are 
not distorted. Then a separate SEC process can ana- 
lyze the log file and create single line events that are 
passed to a parent SEC for global correlation. This is 
similar to the method used by Addamark [Sah02]. 


Although keeping the log streams separate simpli- 
fies some log analysis tasks, it prevents the recognition 
of conditions that affect multiple event streams. 
Although SEC provides a mechanism for identifying the 
source of an event, performing event recognition across 
streams requires that the event streams be merged. 


SEC Correlation Idioms and Strategies 


This section describes particular event scenarios 
that I have seen in my analysis of logs. It demonstrates 
idioms for SEC that model and recognize these scenarios. 


SEC Primer 
A basic knowledge of SEC’s configuration lan- 
guage is required to understand the rules presented 


below. There are nine basic rule types. I break them 
into two groups: basic and complex rules. Basic rules 
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types perform actions and do not start an active corre- 
lation operation that persists in time. These basic types 
are described in the SEC man page as: 

e Suppress: suppress matching input event (used 
to keep the event from being matched by later 
rules). 

e Single: match input event and immediately exe- 
cute an action that is specified by rule. 

¢ Calendar: execute an action at specific times 
using a cron like syntax. 


Complex rules start a multi-part operation that 
exists for some time after the initial event. The sim- 
plest example is a SingleWithSuppress rule. It triggers 
on an event and remains active for some time to sup- 
press further occurrences of the triggering event. A 
Pair rule recognizes a triggering event and initiates a 
search for a second (paired) event. It reduces two sep- 
arate but linked events to a single event pair. The com- 
plex types are described in the SEC man page as: 

e SingleWithScript: match input event and 
depending on the exit value of an external 
script, execute an action. 

e SingleWithSuppress: match input event and 
execute an action immediately, but ignore fol- 
lowing matching events for the next T seconds. 

° Pair: match input event, execute the first action 
immediately, and ignore following matching 
events until some other input event arrives 
(within an optional time window T). On arrival 
of the second event execute the second action. 

¢ PairWithWindow: match input event and wait 
for T seconds for another input event to arrive. 
If that event is not observed within a given time 
window, execute the first action. If the event 
arrives on time, execute the second action. 

e SingleWithThreshold: count matching input 
events during T seconds and if given threshold 
is exceeded, execute an action and ignore all 
matching events during rest of the time win- 
dow. 

e SingleWith2Thresholds: count matching input 
events during T1 seconds and if a given thresh- 
old is exceeded, execute an action. Now start to 
count matching events again and if their num- 
ber per T2 seconds drops below second thresh- 
old, execute another action. 


SEC rules start with a type keyword and continue 
to the next type keyword. In the example rules below, 
‘...’ 1s used to take the place of keywords that are not 
needed for the example, they do not span rules. The 
order of the keywords is unimportant in a rule definition. 


SEC uses Perl regular expressions to parse and 
recognize events. Data is extracted from events by 
using subexpressions in the Perl regular expression. 
The extracted data is assigned to numeric variables $1, 
$2, ..., SN where N is the number of subexpressions 
in the Perl regular expression. The numeric variable 
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$0 is the entire event. For example, applying the regu- 
lar expression “‘([A-z]*): test number ([0-9]*)”’ to the 
event “HostOne: test number 34’’ will assign $1 the 
value ‘“‘HostOne”’, $2 the value ‘‘34”, and $0 will be 
assigned the entire event line. 


Because complex rule types create ongoing cor- 
relation operations, a single rule can spawn many 
active correlation operations. Using the regular 
expression above, we could have one correlation that 
counted the number of events for Host and another 
separate correlation that counted events for HostTwo. 
Both counting correlations would be formed from the 
same rule, but by extracting data from the event the 
two correlations become separate entities. 


This data allows the creation of unique contexts, 
correlation descriptions and coupled patterns linked to 
the originating event. We will explore these items in 
more detail later. Remember that when applying a 
rule, the regular expression or pattern is always 
applied first regardless of the ordering of the key- 
words. As a result, references to $1, $2, ..., $N any- 
where else in the rule refer to the data extracted by the 
regular expression. 


SEC provides a flow control and data storage 
mechanism called contexts. As a flow control mecha- 
nism, contexts allow rules to influence the application 
of other rules. Contexts have the following features: 

¢ Contexts are dynamically created and often 
named using data extracted from an event to 
make names unique. 

¢ Contexts have a defined lifetime that may be 
infinite. This lifetime can be increased or 
decreased as a result of rules or timeouts. 

¢ Multiple contexts can exist at any one time. 

e A context can execute actions when its lifetime 
expires. 

¢ Contexts can be deleted without executing any 
end-of-lifetime actions. 

¢ Rules (and the correlations they spawn) can use 
boolean expressions involving contexts to 
determine if they should apply. Existing con- 
texts return a true value; non-existent contexts 
return a false value. If the boolean expression is 
true, the rule will execute, if false the rule will 
not execute (be suppressed). 


In addition to a flow control mechanism, con- 
texts also serve as storage areas for data. This data can 
be events, parts of events or arbitrary strings. All con- 
texts have an associated data store. In this paper, the 
word “context” is used for both the flow control 
entity and its associated data store. When a context is 
deleted, its associated data store is also deleted. Con- 
texts are most often used to gather related events, for 
example login and logout events for a user. These con- 
texts can be reported to the system administrator if 
certain conditions are detected (e.g., the user tried to 
perform a su during the login session). 


The above description might seem to imply that a 
single context has a single data store; this is not always 
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the case. Multiple contexts can share the same data 
store using the alias mechanism. This allows events 
from different streams to be gathered together for 
reporting or further analysis. The ability to extract data 
from an event and linking the context by name to that 
event provides a mechanism for combining multiple 
event streams into a single context that can be reported. 
For example, if I extract the process ID 345 from sys- 
log events, I can create a context called: process_345 
and add all of the syslog events with the same PID to 
that event. If I now link the context process_346 to the 
process_345 context, I can add all of the syslog events 
with the pid 346 to the same context (data store). So 
now the process_345/process_346 context contains all 
of the syslog events from both processes. 


In the paper, I use the term ‘session.’ A session is 
simply a record of events of interest. In general these 
events will be stored in one or more contexts. If ssh 
errors are of interest, a session will record all the ssh 
events into a context (technically a context data store that 
may be known by multiple names/aliases) and report that 
context. If tracing the identities that a user assumes dur- 
ing a login is needed, a different series of data is 
recorded in a context (data store): the initial ssh connec- 
tion information is recorded, the login event, the su event 
as the user tries to go from one user ID to another. 


The rest of the elements of SEC rules will be 
presented as needed by the examples. 


Responding To Or Filtering Single Events 


The majority of items that we deal with in pro- 
cessing a log file are single items that we have to 
either discard or act upon. Discardable events are the 
typical noise where the problem is either not fixable, 
for example a failing reverse DNS lookups on remote 
domains from tep wrappers, or are valueless informa- 
tion that we wish to discard. 


Discardable events can be handled using the sup- 
press rule. Figure 2 is an example of such a rule. 


type=suppress 
desc=ignore non-specific paper problem \ 
report since prior events have \ 
given us all we need. 
ptype=regexp 
pattern=. printer: paper problem$ 
Figure 2: A suppress rule that is used to ignore a 
noise event sent during a printer error. Note: SEC 
example rules are reformatted/split for readabil- 
ity. They may or may not work exactly as pre- 
sented. 





Since all of my rule sets report anything that is 
not handled, we want to explicitly ignore all noise 
lines to prevent them from making it to the default 
“report everything” rule. 


This is a good time to look at the basic anatomy 
of a SEC rule. All rules start with a type option as 
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described earlier. All rules have a desc option that doc- 
uments the rule’s purpose. For the complex correlation 
rules, the description is used differentiate between cor- 
relation operations derived from a single rule. We will 
see an example of this when we look at the horizontal 
port scan detection rules. 


Most rules have a pattern option that is applied to 
the event depending on the ptype option. The pattern 
can be a regular expression, a substring, or a truth 
value (TRUE or FALSE). The ptype option specifies 
how the pattern option is to be interpreted: a regular 
expression (regexp), a substring (substr), or a truth value 
(TValue). It also determines if the pattern is successfully 
applied if it matches the event match (regexp/substr), or 
does not match (nregexp/nsubstr) the event. For TValue 
type patterns, TRUE matches any event (successful 
application), while FALSE (not successfully applied) 
does not match any input event. If the pattern does not 
successfully apply, the rule is skipped and the next rule 
in the configuration file is applied. 


A number can be added to the end of any of the 
nregexp, regexp, substr, or nsubstr values to make the pat- 
tern match across that many lines. So a ptype value of reg- 
exp2 would apply the pattern across two lines of input. 


By default when an event triggers a rule, the 
event is not compared against other rules in the same 
file. This can be changed on a per rule basis by using 
the continue option." 


After single event suppression, the next basic 
rule type is the single rule. This is used to take action 
when a particular event is received. Actionable events 
can interact with other higher level correlation events: 
adding the event to a storage area (context), changing 
existing contexts to activate or deactivate other rules, 
activating a command to deal with the event, or just 
reporting the event. Figure 3 is an example of a single 
rule that will generate a warning if the printer is 
offline from an unknown cause. 


In Figure 3 we see two more rule options: context 
and action. The context option is a boolean expression 
of contexts that further constrains the rule. 


When processing the event 
1j2.cs.umb.edu: printer: Report Printer 
Offline if needed 


the single rule in Figure 3 checks to see if the pattern 
applies successfully. In this case the pattern matches 
the event, but if the Report_Printer_lj2.cs.umb.edu_Offline 


1Note: continue is not supported for the suppress rule type. 


type=single 
continue=dontcont 
desc = Report Printer Offline if needed 


ptype=regexp 
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context does not exist, then the actions will not be exe- 
cuted. The context Report_Printer_|j2.cs.umb.edu_Offline is 
deleted by other rules in the ruleset (not shown) if a 
more exact diagnosis of the cause is detected. This sup- 
presses the default (and incorrect) report of the problem. 


The action option specifies the actions to take 
when the rule fires. In this case it writes the message 
printer |j2.cs.umb.edu offline, unknown cause to standard 
output (specified by the file name “-’”) . Then it 
deletes the context Report_Printer_lj2.cs.umb.edu_Offline 
since it is no longer needed. 


There are many potential actions, including: 
® creating, deleting, and performing other opera- 
tions on contexts 
e invoking external programs 
® piping data or current contexts to external pro- 
grams 
resetting active correlations 
evaluating Perl mini-programs 
setting and using variables 
creating new events 
running child processes and using the output 
from the child as a new event stream. 
We will discuss and use many of these actions later in 
this paper. 
Scheduling Events With Finer Granularity 


Part of modeling normal system activity includes 
accounting for scheduled activities that create events. 
For example, a scheduled weekly reboot is not worth 
reporting if the reboot occurs during the scheduled 
window, however it is worth reporting if it occurs at 
any other time. 


For this we use the calendar rule. It allows the 
reader to schedule and execute actions on a cron like 
schedule. In place of the ptype and pattern options it 
has a time option that has five cron-like fields. It is 
wonderful for executing actions or starting intervals on 
a minute boundary. Sometimes we need to start inter- 
vals with resolution of a second rather than a minute. 


Figure 4 shows a mechanism for generating a 
window that starts 15 seconds after the minute and lasts 
for 30 seconds. The key is to create two contexts and 
use both of them in the rules that should be active (or 
inactive) only during the given window. One context 
wait_for_window expires to begin the timed interval. The 
window context expires marking the end of the interval. 
Creating an event on a non-minute boundary is trivial 
once the reader learns that the event command has a 
built in delay mechanism. 


pattern=*(\w._-]+): printer: Report Printer Offline if needed 


context = Report_Printer_$1_Offline 


action = write - "printer $1 offline, unknown cause" ; \ 


delete Report_Printer_$1_Offline 


Figure 3: A single command that writes a warning message and deletes a context that determines if it should execute. 
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Triggering events generated by calendar rules or by 
expiring contexts can execute actions, define intervals, 
trigger rules or pass messages between rules. Triggering 
events are used extensively to detect missing events. 


type=calendar 

time=30°3 * * * 

desc=create 30 second window 

action=create window 45: \ 
create wait_for_window 15 


type=single 


context=window && !wait_for window 


Figure 4: A mechanism for creating an timed interval 
that starts on a non-minute boundary. 





Detecting Missing Events 


The ability to generate arbitrary events and win- 
dows with arbitrary start and stop times is useful when 
detecting missing events. The rules in Figure 5 report 
a problem if a ‘sendmail -q’ command is not run by 
root near 31 minutes after the hour. Because of natural 
variance in the schedule, I expect and accept a send- 
mail start event from five seconds before to 10 sec- 
onds after the 31st minute. 


## rule 1: detect the sendmail event 
type = single 

desc 
ptype = regexp2 
pattern = ~\> 


CMD: /usr/lib/sendmail -q.*\n\> 


Real-time Log File Analysis Using the Simple Event Correlator (SEC) 


The event stream from Figure | is used as input 
to the rules. Figure 6 displays the changes that occur 
while processing the events in time. Contexts are rep- 
resented by rectangles, the length of the rectangle is 
the context’s lifetime. Upside down triangles represent 
the arrival of events. Regular triangles represent 
actions within SEC. The top graph shows the sequence 
when the sendmail event fails to arrive, while the bot- 
tom graph shows the sequence when the sendmail pro- 
gram is run. 


The correlation starts when rule 2 (the calendar 
rule) creating the context sendmail_31_minute that will 
execute an action (write a message to standard output) 
when it times out after 70 seconds (near 31 minutes 
and 10 seconds) ending the interval. The calendar rule 
creates a second context, sendmail_31_minute_ inhibit, 
that will timeout in 55 seconds (near 30 minutes and 55 
seconds) starting the 15 second interval for the arrival 
of the sendmail event. Looking at the top graph in Fig- 
ure 7, we see the creation of the two contexts on the 
second and third lines. No event arrives within the 15 
second window, so the sendmail_31_minute expires and 
executes the “write” action. The bottom graph shows 
what happens if the sendmail event is detected. Rule | 
is triggered by the sendmail event occurring in the 15 


sendmail has run, don’t report it as failed 


root (€.[10-9]+) <« .* 


context = sendmail_31_minute && ! sendmail_31_minute_inhibit 

action = delete sendmail _31_minute 

## rule 2: define the time window and prep to report a missing event 
type = calendar 

desc = Start searching for sendmail invocation at 31 past hour 
tane“s0 *) ?.- * 

action = create sendmail_3l_minute 70 write - \ 


Sendmail failed to run detected at %t: \ 
create sendmail 31l_minute_inhibit 55 


Figure 5: Rules to detect a missed execution of a sendmail process at the appointed time. 
a Rec ace cee cee SER. Es ee ee Re x 








Calendar rules fires at 30 minutes after the hour 
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Calendar rules fires at 30 minutes after the hour 


0 10 20 30 40 
Start 30 minutes after hour 


50 








Write message when sendmail_31_minute expires 
sendmail_31_ minute 


sendmail_31_minute_inhibit 


60 70 75 80 seconds 


Delete sendmail_31_minute when CMD event arrives 
sendmail_31_minute 

sendmail_31_minute_inhibit 

Sendmail CMD line detected 


70 73 80 seconds 
Ends 31 minutes and 10 seconds after hour 


60 


Figure 6: Two timelines showing the events and contexts involved in detecting a missing, or present, sendmail in- 


vocation from cron. 
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seconds window and deletes the sendmail_31_minute 
context. The deletion also prevents the “write” action 
associated with the context from being executed. 


Note that the boolean context of rule | prevents 
its execution if the sendmail event were to occur less 
than five seconds before the 31st minute since ! sendmail_ 
31_minute_inhibit is false because sendmail_31_minute_ 
inhibit exists and is therefore true. If the sendmail event 
occurs after 31 minutes and 10 seconds, the context is 
again false since sendmail_31_minute does not exist, 
and is false. 


The example rules use the write action to report a 
problem. In a real ruleset, the systems administrator 
could use the SEC shellcmd action to invoke logger(1) 
to generate a syslog event to be forwarded to a central 
syslog server. This event would be found by SEC run- 
ning on the syslog master. The rule matching the event 
could notify the administrator via email, pager, wall(1) 
or send a trap to an NMS like HPOV or Nagios. 
Besides reporting, the event could be further processed 
with a threshold rule that would try to restart cron as 
soon as two or more “missed sendmail events” events 
are reported, and report a problem only if a third con- 
secutive ““missed sendmail event” arrived. 


Repeat Elimination/Compression 


I have dealt with real-time log file reporters that 
generated 300 emails when a partition filled up 
overnight. There must be a method to condense or de- 
duplicate repeated events to provide a better picture of 
a problem, and reduce the number of messages spam- 
ming the administrators. 


The SingleWithSuppress rule fills this de-duplica- 
tion need. To handle file system full errors, the rule in 
Figure 7 is used. 


This rule reports that the filesystem is full when it 
receives its first event. It then suppresses the event mes- 
sage for the next hour. Note that the desc keyword 
includes the filesystem and hostname ($2 and $1 


## Example: 


+ Apr 13 15:08:52 host4.example.org ufs: 
tf 
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respectively). This makes the correlation operation that 
is generated from the rule unique so that a disk full con- 
dition on the same host for the filesystem /mount/fs2 
will generate an error event if it occurs five minutes 
after the /mount/sdOf event. If the filesystem was not 
included in the desc option, then only one alert for a full 
filesystem would be generated regardless of how many 
filesystems actually filled up during the hour. 


Report on Analysis of Event Contents 


Unlike most other programs, SEC allows the 
reader to extract and analyze data contained within an 
event. One simple example is the rule that analyzes 
NTP time adjustments. I consider any clock with less 
than 1/4 a second difference from the NTP controlled 
time sources to be within a normal range. Figure 8 
shows the rules that are applied to analyze the xntpd 
time adjustment events. We extract the value of the 
time change from the step messages. This value is 
assigned to the variable $1. The context expression 
executes a Perl mini-program to see if the absolute 
value of the change is larger than the threshold of 0.25 
seconds. If it is, the context is satisfied and the rule’s 
actions fire. 


The context expression uses a mechanism to run 
arbitrary Perl code. It then uses the result of the 
expression to determine if the rule should fire. It can 
be used to match networks after applying a netmask, 
perform calculations with fields of the event or other 
tasks to properly analyze the events. 


Detect Identical Events Occurring Across Multiple 
Hosts 


A single incident can affect multiple hosts. 
Detecting a series of identical events on multiple hosts 
provides a measure of the scope of the problem. The 
problem can be an NFS server failure affecting only 
one host that does not need to be paged out in the mid- 
dle of the night, or it may affect 100 hosts, which 
requires recovery procedures to occur immediately. 


[ID 845546 \ 


kern.notice] NOTICE: alloc: /mount/sd0f: file system full 


type=SingleWithSuppress 
desc=Full filesystem $2 on $1 
ptype=regexp 


pattern=(\w._-]+) ufs: \[.* NOTICE: alloc: 


(\w/._-]+): file system full 


action= write - filesystem $2 on host $1 full 


window=3600 


Figure 7: A rule to report a file system full error and suppress further errors for 60 minutes. 





type=single 


desc = report large xntpd corrections for host $1 


continue = dontcont 


ptype=regexp 
context= =(abs($2) > 0.25) 


pattern=([A-z0-9._-]+) xntpd\[[0-9]+\]:.*time reset \(step\) ([-]?[0-9.J+) s 
action= write - "large xntpd correction($2) on $1" 


Figure 8: Rule to analyze time corrections in NTP time adjustment events. The absolute value of the time adjust- 
ment must be greater than 0.25 seconds to generate a warning. 
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Other problems such as time synchronization, or 
detection of port scans also fall into this realm. 


One typical example of this rule is to detect hori- 
zontal port scans. The rules in Figure 9 identify a hori- 
zontal port scan as three or more connection denied 
events from different server hosts within five minutes 
from a particular external host or network. So 20 con- 
nections to different ports on the same host would not 
result in the detection of a horizontal scan. In the 
example, I assume that the hosts are equipped with 
TCP wrappers that report denied connections. The set 
of rules in Figure 9 implements the detection of a hori- 
zontal port scan by counting unique client host/server 
host combinations. A timeline of these three rules is 
shown in Figure 10. 


## Example input: 


Real-time Log File Analysis Using the Simple Event Correlator (SEC) 


The key to understanding these rules is to realize 
that the description field is used to match events with 
correlation operations. When rule 1, the threshold cor- 
relation rule, sees the first rejected connection from 
192.168.1.1 to 10.1.2.3, it generates a Count denied 
events from 192.168.1.1 correlation. The next time a 
deny for 192.168.1.1 arrives, it will be tested by rule 
1, the description field generated from this new event 
will match an ongoing correlation threshold operation 
and it will be considered part of the Count denied events 
from 192.168.1.1 threshold correlation. If a rejection 
event for the source 193.1.1.1 arrives, the generated 
description field will not match an active threshold 
correlation, so a new correlation operation will be 
started with the description Count denied events from 


## May 10 13:52:13 cyber TCPD-Event cyber:127.6.7.1:3424:sshd deny \ 
+ badguy.example.com:192.268.15.45 user unknown 
## Variable = description (value from example above) 


# $3 = server ip address (127.6.7.1) 

## $5 = daemon or service connected to on server (sshd) 

# $8 = ip address of client (attacking) machine (192.268.15.45) 
# $9 = 1st quad of client host ip address (192) 


# $10 = 2nd quad of client host ip address (6) 


= 

> 

i 

i) 
tll 


3rd quad of client host ip address (7) 
4th quad of client host ip address (1) 


## Rule 1: Perform the counting of unique destinations by client host/net 


type = SingleWithThreshold 
desc = Count denied events from $8 
continue = takenext 


ptype = regexp 


pattern’ ="" (.*) TCPD-Event ({A~20=9..J*) s ([0*9.:]*)a:( [659] *) <€ [+ J) beer): “A 
CL*2J*) : CEL0=9] *) Xs CL0-9)-*)\, (10-9) *)\. CLO-OF *))imser 6.%) 


action = report conn_deny_from_$8 /bin/cat >> report_log 


context = ! seen_connection_from_$8_to_$3 


thresh = 3 
window = 300 


#if Rule 2: Insert a rule to capture synthesized network tcpd events. 


type=single 


pattern = *(.*) TCPD-Event ([A-20-9_.]*): 
CL) 1422 COT-S)-49\ « 20-9) 4).\. 


action=none 


O=9..,];* 
O-91*):\. 


Ate COeoh et. *, Cdeiag) \ 
({0-9]*)) user (.*) nets 


## Rule 3: Generate network counting rules and maintain contexts 


type = single 
desc = 
continue = takenext 


ptype = regexp 


maintain counting contexts for deny service $5 from $8 event 


pattern = “(.*) TCPD-Event ([A-z0-9_.]*):([0-9.]*):((0-9]*):([* ]*) (deny) \ 


CO 33 * J 2CCL0-9)*)\...¢ £0-9] *) ¥-€10=9] *) \.. (f0-9]'*)) ween <.*) 

context = ! seen_connection_from_$8_to_$3 
action = create seen_connection_from_$8 _to_$3 300: \ 

add conn_deny_from_$8 $0 ; \ 

event 0 $1 TCPD-Event $2:$3:$4:$5 $6 $7:$9.$10.$11.0 user $13 net: \ 

event 0 $1 TCPD-Event $2:$3:$4:$5 $6 $7:$9.$10.0.0 user $13 net: \ 

event 0 $1 TCPD-Event $2:$3:$4:$5 $6 $7:$9.0.0.0 user $13 net: \ 

add conn_deny_from_$9.$10.$11.0 $0 ; \ 

add conn_deny_from_$9.$10.0.0 $0 ; \ 

add conn_deny_from_$9.0.0.0 $0 


Figure 9: Rules to detect horizontal port scans defined by connections to 3 different server hosts from the same 
client host within 5 minutes. Note: patterns are split for readability. This is not valid for sec input. 
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source2. Figure 10 shows a correlation operation from 
start to finish. First the event El reports a denial from 
host 192.168.1.1 to connect/scan 10.1.2.3. The corre- 
lation operation Count denied events from 192.168.1.1 is 
started by rule 1, rule 2 is skipped because the pattern 
does not match, and rule 3 creates the 5-minute-long 
context | seen_connection_from_192.168.1.1_to_10.1.2.3 
that is used to filter arriving event to make sure that 
only unique events are counted. The rest of rule 3’s 
actions will be discussed later. 


The count for rule 1, the threshold correlation 
operation, is incremented only if the seen_connec- 
tion_from_192.168.1.1_to_10.1.2.3 context does not exist. 
When the El/2 (event | number 2) arrives, this con- 
text still exists and all the rules ignore the event. When 
E2/1 arrives, it triggers rule | and rule 3 creating the 
appropriate context and incrementing the threshold 
operation’s count. 


When five minutes have passed since El/I1’s 
arrival and the threshold rule has not been triggered by 
the arrival of three events, the start of the threshold 
rule is moved to the second event that it counted, and 
the count is decremented by 1. This occurs because 
the threshold rule uses a sliding window by default. 
When events 3/1 and 4/1 arrive, they are counted by 
the shifted threshold correlation operation started by 
rule 1. With the arrival of E2/1, E3/1, and E4/1, three 
events have occurred within five minutes and a hori- 
zontal port scan is detected. As a result, the action 
reporting the context conn_deny_from_192.168.1.1 is 
executed and the events counted during the correlation 
operation (maintained by the add action of rule 3) are 
reported to the file report_log. 


Rule 2 and the final actions of rule 3 allow detec- 
tion of horizontal port scans even if they come from 
different hosts such as: 192.168.3.1, 192.168.1.1, and 
192.168.7.1. If each of these hosts scans a different 
host on the 10 network, it will be detected as a 
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horizontal scan from the 192.168.0.0 network. This is 
done by creating three events replacing the real source 
address with a corresponding network address. One 
event is created for each class A, B and C network that 
the original host could belong to: 192.168.1.0, 
192.168.0.0, and 192.0.0.0. The response to these syn- 
thesized events are not shown in Figure 10, but they 
start a parallel series of correlation operations and 
contexts using the network address of the client in 
place of 192.168.1.1. 


Vertical scans can use the same framework with 
the following changes: 

° the filtering context needs to include port num- 
bers so that only unique client host/server 
host/port triples are counted by the threshold 
rule. 

e the description of rule | to include the server 
host IP so that it only counts connections to a 
specific server host. 


This will count the number of unique server ports 
that are accessed on the server from the client host. 


In general, using rules | and 3, you can count 
unique occurrences of a value or group of values. The 
context used to link the rules must include the unique 
values in its name. The description used in rule | will 
not include the unique values and will create a bucket 
in which the events will be counted. In the horizontal 
port scan case, case, my bucket was any connection 
from the same client host. The unique value was the 
server IP address connected to by the the client host. 
In detecting a vertical port scan, the value is the num- 
ber of unique ports connected to while the bucket is 
the client/server host pair. 


These two changes allow the counting ruleset to 
count the number of unique occurrences of the parameter 
that is present in the filtering rule, but missing from 
the rule 1 description (the bucket), e.g., if the context 











"—, 
ies 








5 Minute window for correlation shifts to encompass three events in 5 minutes 


| — Correlation "Count denied events from 192.168.1.1" window shifted to next event. 


Report excessive denied events for 192.168.1.1 


—— eee ee ee ee ee ee eee ee ee 
—— 
__ 


eal Correlation "Count denied events from 192.168.1.1" 
seen_connection_from_192.168.1.1_to_10.1.2.6 i ares 
seen_connection_from_192.168.1.1_to_10.1.2.5 et ee 
seen_connection_from_192.168.1.1_to_10.1.2.4 Jel = ie Se 
Pa ae ee ee seen_connection_from_192.168.1.1_to_10.1.2.3 
/\\\ Generate network events /\\. Generate network events Generate network events - cater network events 
WV EW! Wev2 Vea E3/I E4/I 
0 min | min 2 min 3 min 4 min 5 min 6 min 


Event! TCPD-—Event 192.168.1.1 to 10.1.2.3 
Event 2 TCPD-—Event 192.168.1.1 to 10,1.2.4 
Event 3 TCPD—Event 192.169.1.1 to 10,1,.2.5 
Event 4 TCPD-Event 192.168.1.1 to 10.1.2.6 


Note start of event correlation shifts from E1/1 to E2/1 (which is second event counted) to detect 3 events/Smin. 


Event 1. number 2 is not counted because of the existance of the seen_connection_from_192.168.1.1_to_10.1.2.3 context. 


Figure 10: Timeline showing the application of rules to detect horizontal port scans. 


142 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


Rouillard 


specifies serverhost, clienthost, serverport and rule | 
specifies clienthost and serverhost in its description, 
then the rules above implement counting of unique 
ports for a given clienthost and serverhost. The rules 
as presented above specified clienthost and serverhost, 
rule 1 specified the clienthost, so the ruleset counted 
unique serverhost’s for a given clienthost. 


Other counting methods can also be imple- 
mented using mixtures of the vertical and horizontal 
counting methods. 


While I implemented a “‘pure” SEC solution, the 
ability to use Perl functions and data structured from 
SEC rules provides other solutions [Vaarandi7_ 2003] 
to this problem. 


Creating Threads of Events from Multiple Sources 


Many thread recognition operations involve using 
one of the pair type rules. Pair rules allow identifica- 
tion of a future (child) event by searching for identify- 
ing information taken from the present (parent) event. 
This provides the ability to stitch a thread through vari- 
ous events by providing a series of pair rules. 


There are three times when you need to trigger 
an action with pair rules: 
1. Take action upon receipt of the parent event 
2. Take action upon receipt of the child event 
3. Take action after some time when the child 
event has not been received (expiration of the 
pair rule). 
The Pair rule provides actions for triggers | and 2. 
The PairWithWindow rule provides actions for triggers 2 
and 3. None of the currently existing pair rules provides 
a mechanism for taking actions on all three triggers. 
Figure 11 shows a way to make up for this limitation 
by using a context that expires when the pair rule is due 
to be deleted. Since triggers 2 and trigger 3 are mutu- 
ally exclusive, part of trigger 2’s action is to delete the 
context that implements the action for to trigger three. 


I have used this method for triggering an auto- 
matic repair action upon receipt of the first event. The 
arrival of the second event indicated that the repair 
worked. If the second event failed to arrive, an alert 
would be sent when the context timed out. Also, I 
have triggered additional data gathering scripts from 
the first event. The second event in this case reported 
the event and additional data when the the end of the 
additional data was seen. If the additional data did not 
arrive on time, I wanted the event to be reported. 


Real-time Log File Analysis Using the Simple Event Correlator (SEC) 


This mechanism can replace combinations of 
PairWithWindow and Single rules. It simplifies the rules 
by eliminating duplicate information, such as patterns, 
that must be kept up to date in both rules. 


Correlating Across Processes 


One of more difficult correlation tasks involves 
creating a session made up of events from multiple 
processes. 


Figure 12 shows a ruleset that sets up a link 
between parent and child ssh processes. Its application 
is show in Figure 13. 


When a connection to ssh occurs, the parent 
process, running as root, reports the authentication 
events and generates information about a user’s login. 
After the authentication process, a child sshd is 
spawned that is responsible for other operations includ- 
ing port forwarding and logout (disconnect) events. 
The ruleset in Figure 12 captures all of the events gen- 
erated by the parent or child ssh process. This includes 
errors generated by the parent and child ssh processes. 


A session starts with the initial network connec- 
tion to the parent sshd and ends with a connection 
closed event from the child sshd. I accumulate all 
events from both processes into a single context. I also 
have rules (not shown in the example) to report the 
entire context when unexpected events occur. 


The tricky part is accumulating the events from 
both processes into a single context. The connection 
between the event streams is provided by a tie event 
that encompasses unique identifying elements from 
both event streams and thus ties together the two 
streams into a single stream. 


Each ssh process has its own unique event 
stream stored in the context session_log_<host- 
name>_<pid>. There is a Single rule, omitted for 
brevity, that accumulates ssh events into this context. 
When the tie event is seen, it provides the link 
between the parent sshd session _log context and the 
child session_log context. The data from the two con- 
texts is merged and the two context names (with the 
parent and child pid’s) are assigned to the same under- 
lying context. Hence the child’s session_log_<hostname>_ 
<child pid> context and the parent’s session_log_ 
<hostname>_<parent pid> contexts refer to the same 
data. After the contexts are linked, actions using either 
the child context name or the parent context name 





type=pair 
action = write - rule triggered ; \ 
create take_action_on_pair_expiration 60 (write - rule expired) 
pattern2= 
action2 = write - pattern 2 seen ; \ 


delete take_action_on_pair_expiration 


window=60 


Figure 11: A method to take an action on all three trigger points in a pair rule. 
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operate on the same underlying context. Reporting or 
adding to the context using one of the linked names 
acts the same regardless of which name is used. 


In Figure 14 the first event El triggers rule | 
from Figure 13, the PairWithWindow rule, to recognize 
the start of the session. The second half of rule 1 looks 
for a tie event for the following 60 seconds. There 
may be many tie events, but there should be only one 
tie event that contains the the pid of the parent sshd. 
Since we have that stored in $2, we use it in pattern2. 
The start of session event is passed onto additional 
rules (not shown) by setting the continue option on 
rule 1 to takenext. These additional rules record the 
events in the session_log context identified by system 
and pid, as in the session_log_example. org_10240 con- 
text of Figure 14. 


If the tie event is not found within 60 seconds, 
the session_log_example.org_10240 context is reported. 
However, if the tie event is found as in Figure 13, then 
a number of other operations occur. The tie event is 
generated by a script that is run by the child sshd. 
Therefore it is possible for the child sshd to generate 
events before the tie event is created. Because of the 
default rule that adds events to the session_log_example. 
org_10245, additional work must be done when the tie 
event arrives to preserve the data in the child’s 


Rouillard 


session_log. The second part of rule | in Figure 12 
copies child’s session log context into the variable 
%b. The child’s session_log is then deleted and aliased 
to the parent session log. The %2 variable is the value 
of $2 from the first pattern, the parent process’s PID. 
After pattern2 is applied, the parent PID is referenced 
as %2 because $2 is now the second subexpression of 
pattern2. Next the data copied from the child log is 
injected into the event stream to allow re-analysis and 
reporting using the combined parent and child context. 


The last action for the tie event is to alias the 
login username stored in the context session_log_owner_ 
<hostname>_<parent pid> to a similar context under the 
child pid. Then any rule that analyzes a child event 
can obtain the login name by referencing the alias con- 
text. Rule 2 in Figure 12 handles the login event and 
creates the context session_log_owner_<hostname>_ 
<parent pid> where it stores the login name for use by 
the other rules in the ruleset. Rule 2 also stores the 
login event in the session_log context. 


The last rule is very simple. It detects the “‘close 
connection” (logout) event and deletes the contexts 
created during the session. The delivery of event N 
(EN) in Figure 13 causes deletion of contexts. Deleting 
an aliased context deletes the context data store as 
well as all the names pointing to the context data store. 


## rule 1 - recognize the start if an ssh session, 
+ and link parent and child event contexts. 


type=PairWithWindow 
continue=takenext 


desc=Recognize ssh session start for $1[$2] 


ptype=regexp 


pattern=([A-Za-z0-9._-]+) sshd\[([0-9]+)\]: \[{*]]+\] Connection from ([0-9.]+) \ 


port [0-9]+ 


action=report session_log_$1_$2 /bin/cat 


desc2=Link parent and child contexts 
ptype2=regexp 


pattern2=([A-Za-z0-9._-]+) [A-z0-9]+\[[0-9]+\]: \{E{*]]+\] SSHD child process +([0-9]+\ 


spawned by $2 
action2=copy session_log _$1_$2 %b; \ 
delete session_log $1_$2; \ 


alias session_log _$1_%2 session_log_$1_$2; \ 


add session_log_$1_$2 $0; \ 
event 0 %b; \ 


alias session_log_owner_$1_%2 session_log_owner_$1_$2; \ 


window=60 


## rule 2 - recognize login event and save username for later use 


type=single 
desc=Start login timer 
ptype=regexp 


pattern=([A-Za-z0-9._-]+) sshd\[([0-9]+)\]: \{[*]]+\] Accepted \ 
(publickey|password) for ([A-z0-9_-]+) from [0-9.]+ port [0-9]+ (.*) 
action=add session_log_$1_$2 $0; add session_log_owner_$1_$2 $4 


## rule 3 - handle logout 
type=single 

desc=Recognize ssh session end 
ptype=regexp 


pattern=([A-Za-z0-9._-]+) sshd\[([0-9]+)\]: \[E([*]]+\] Closing connection to ([0-9.]+) 
action= delete session_log_$1_$2; delete session_log_owner_$1_$2 


Figure 12: Accumulating output from ssh into a single context. 
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Rule 3 uses the child PID to delete session_log_example. 
org_10245 and session_log_owner_example.org_10245, 
which cleans up all four context names (two from the 
parent PID and two from the child) and both context 
data stores. 


This mechanism can be used for correlating any 
series of events and passing information between the 
rules that comprise an analysis mechanism. The trick is 
to find suitable tie events to allow the thread to be fol- 
lowed. The tie event must contains unique elements 
found in the events streams that are to be tied together. In 
the ssh correlation I create a tie event using the pid’s of 
the parent and child events. Every child event includes 
the PID of the child sshd so that I can easily construct 
the context name that points to the combined context 
data store. For the ssh correlation, I create the tie event 
by running shell commands using the sshre mechanism 
and use the logger(1) command to inject the tie event 
into the data stream. This creates the possibility that the 
tie event arrives after events from the child process. It 
would make the correlation easier if I modified the sshd 
code to provide this tie event since this would generate 
the events in the correct order for correlation. 


Having the events arriving in the wrong order for 
cross correlation is a problem that is not easily reme- 
died. I suppress reporting of the child events while 
waiting for the tie event (not shown). Then once the 
tie event is received, the child events are resubmitted 
for correlation. This is troublesome and error prone 
and is an area that warrants further investigation. 


Strategies to Improve Performance 


One major issue with real-time analysis and noti- 
fication is the load imposed on the system by the anal- 
ysis tool and the rate of event processing. The rules 
can be restructured to reduce the computational load. 












a. 
aed Recorded in context 


2 VV E3 E4 


Q min 1 min 


session_log_owner_example.org_10245 


=>_{"___—_————— session_log_owner_example.org_ 10240 


session_log_example.org_10245 
Se ean Wg examipic.org. 10240 


Recorded 


Pair correlation"Recognize ssh session start for example.org[ 10240]" 


Real-time Log File Analysis Using the Simple Event Correlator (SEC) 


In other cases the rule analysis load can be distributed 
across multiple systems or across multiple processes 
to reduce the load on the system or improve event 
throughput for particular event streams. 


The example rule set from UMB utilizes a num- 
ber of performance enhancing techniques. Originally 
these techniques were implemented in a locally modi- 
fied version of SEC. As of SEC version 2.2.4, the last 
of the performance improvements has been imple- 
mented in the core code. 


Rule Construction 


For SEC, construction of the rules file(s) plays a 
large role in improving performance. In SEC, the 
majority of computation time is occupied with recog- 
nizing events using Perl regular expressions. Optimiz- 
ing these regular expressions to reduce the amount of 
time needed to apply them improves performance. 


However, understanding that SEC applies each 
rule sequentially allows the reader to put the most 
often matched rules first in the sequence. Putting the 
most frequently used rules first reduces the search 
time needed to find an applicable rule. Sending a 
USRI signal to SEC causes it to dump its internal 
state showing all active contexts, current buffers, and 
other information including the number of times each 
rule has been matched. This information is very useful 
in efficiently restructuring a ruleset 


Using rule segmentation to reduce the number of 
rules that must be scanned before a match is found 
proves the biggest gains for the least amount of work. 


Rule Segmentation 
In August 2003, I developed a method of using 
SEC’s multiple configuration file mechanism to prune 


the number of rules that SEC would have to test 
before finding a matching rule. 








|\ Tie event causes two contexts to be aliased together into one context with two names 


Note 1, 2 








Z\\ Logout event 


} Recorded 
destroy’s context 


E5 \V EN 


2? min N min 


E] — Initial parent event: example.org sshd[10240]: [ID 800047 auth.info] Connection from 192.168.0.1 port 3500 

E2 — Parent event: example.org sshd[ 10240]: (ID 800047 auth.info] Accepted publickey for rouilj from 192.168.0.1 port 3500 ssh2 
E3 — Tie event: example.org rouilj[ 10248}: [ID 702911 auth.notice] SSHD child process 10245 spawned by 10240 

E4 — Child event: example.org sshd[10245]: [ID 800047 auth.info] error: connect_to 127.0.0.1 port2401 failed 

E5 — Child event: example.org sshd[ 10245]: [ID 800047 auth.info] bind: Cannot assign requested address 

EN — Logout child event: example.org sshd[10245]: [ID 800047auth.info] Closing connection to 192.168.0.1 


Note 1: creation of session_log_example.org_10240 context in response to El is done by a catchall rule that 
is not shown in the ruleset. 
Note 2: alias of two contexts is shown by the large box labeled with both context names. 


Figure 13: The application of the ssh ruleset showing the key events in establishing the link between parent and 


child processes. 
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This mechanism provides a limited branching 
facility within SEC’s ruleset. A single criteria filtering 
rule is shown in Figure 14. 





type=suppress 
continue=dontcont 
ptype=NRegExp 

pattern=* [ABCD] 
desc=guard for abcd rules 


type=single 
continue=dontcont 
ptype=TValue 
pattern=true 


desc=guard for events handled by other \ 


ruleset files 
action=logonly 
context = [handled] 


type=single 
continue=takenext 
ptype=TValue 
pattern=true 
desc=report handled 
action=create handled 


<rules here> 


type=single 
ptype=TValue 
pattern=true 


desc=Guess we didn’t handle this event \ 


after all 
action=delete handled 


Figure 14: A sample rule set to allow events to be fil- 
tered and prevented from matching other rules in 
the file. 





This rule depends on the Nregexp pattern type. 
This causes the rule to match if the pattern does not 
match. The pattern is crafted to filter out events that 
can not possibly be acted upon by the other rules in 
the file. In this example I show another guard that is 
used to prevent this ruleset from considering the event 
if it has been handled. It consists of a rule that matches 
all events? and fires if the handle context is set. If it 
does not eliminate the event from consideration, I set 
the handled context to prevent other rulesets from pro- 
cessing the event and pass the event to the ruleset. If 
the final rule triggers, then the event was not handled 
by any rule in the ruleset. The final rule deletes the 
handled context so that the following rulesets will have 
a chance to analyze the event. 


Note that this last rule is repeated in the final rule 
file to be applied. There it resets the handled context so 
that the next event will be properly processed by the 
rulesets. 


In addition to a single regexp, multiple patterns 
can be applied and if any of them select the event, the 
event will be passed through the rest of the rules in the 
file. A rule chain to accept an event based on multiple 


2The TValue ptype is only available in SEC 2.2.5 and new- 
er. Before that use regexp with a pattern of I>. 97 
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patterns is shown in Figure 15. The multiple filter cri- 
teria can be set up to accept/reject the event using 
complex boolean expressions so that the event must 
match some patterns, but not other patterns. 


type= single 

desc= Accept event if match2 is seen. 
continue= takenext 

ptype= regexp 

pattern= match2 

action= create accept_rule 


type= single 

desc= Accept event if match3 is seen. 
continue= takenext 

ptype= regexp 

pattern= match3 

action= create accept_rule 


type= single 

desc= Skipping ruleset because neither \ 
match2 or match3 were seen. 

ptype= TValue 

pattern= true 

context= ! accept_rule 

action= logonly 


type= single 

desc= Cleaning up accept_rule context \ 
since it has served its purpose. 

continue=takenext 

ptype= TValue 

pattern= true 

context= accept_rule 

action= delete accept_rule; logonly 


<other rules here> 


Figure 15: A ruleset to filter the input event against 
multiple criteria. The words “match2” or 
‘“‘match3”’ must be seen in the input event to be 
processed by the other rules. 


The segmentation method can be arbitrary, how- 
ever it is be most beneficial to group rules by some 
common thread such as the generator, using a a 
file/ruleset for analyzing sshd events and another one 
for xntp events. Another segmentation may be by host 
type. So hosts with similar hardware are analyzed by 
the same rules. Hostname is another good segmentation 
property for rules that are applicable to only one host. 


The segmentation can be made more efficient by 
grouping the input using SEC’s ability to monitor mul- 
tiple files. When SEC monitors multiple files, each 
file can have a context associated with it. While pro- 
cessing a line from the file, the context is set. For 
example, reading a line from /var/adm/messages may 
set the adm_messages context, while reading a line 
from /var/log/syslog would set the log_syslog context 
and clear the adm_messages context. This allows seg- 
mentation of rules by source file. Offloading the work 
of grouping to an external application such as syslog-ng 
provides the ability to group the events not only by 
facility and level as in classic syslog, but also by other 
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parameters including host name, program, or by a 
matching regular expression. Since syslog-ng operates 
on the components of a syslog message rather than the 
entire message, it is expected to be more efficient in 
segmenting the events than SEC. 


Restructuring the rules for a single SEC process 
using a simple five file segmentation based on the first 
letter of the event using an 1800 rule ruleset increased 
throughput by a factor of three. On a fully optimized 
ruleset of 50 example rules, running on a SunBlade 150 
(128 MB of memory, 650 Mhz), I have seen rates 
exceeding 300 lines/sec with less than 40% processor 
utilization. In tests run under the Cygwin environment 
on Microsoft windows 2000, 40 rules produced a 
throughput of 115 log entries per second. This single file 
path of 40 rules is roughly equivalent to a segmented 
ruleset of 17 files with 20 rules each for a total of 340 
rules, with events equally distributed across the rulesets. 


Note that these throughput numbers depend on 
the event distribution, the length of the events etc. 
Your mileage may vary. 


Parallelization of Rule Processing 


In addition to optimizing the rules, multiple SEC 
processes can be run, feeding their composite events 
to a parent SEC. SEC can watch multiple input 
streams. It merges all these streams into a single 
stream for analysis. This merging can interfere with 
recognition of multi-line events as well as acting to 
increase the size of an event queue, slowing down the 
effective throughput rate of a single event stream. 
Running a child SEC process on an event stream 
allows faster response to that stream. 


SEC’s spawn action creates a process and creates 
an event from every line emitted by the child process. 
The events from these child processes are placed on 
the front of the event queue for faster processing. 


These features allow the creation of a hierarchy 
of SEC processes to process multiple rules files. This 
reduces the burden on the parent SEC process by dis- 
tributing the total number of rules across different pro- 
cesses. In addition, it simplifies the creation of rules 
when multi-line events must be considered, by pre- 
venting the events from being distorted by the injection 
of other events in the middle of the multi-line event. 


SEC is not threaded, so use of concurrent pro- 
cesses is the way to make SEC utilize multiprocessor 
systems. However, even on uniprocessor systems, it 
seems to provide better throughput by reducing the 
mean number of rules that SEC has to try before find- 
ing a match. 


Distribution Across Nodes 


SEC has no built-in mechanism for distributing 
or receiving events with other hosts. However, one can 
be crafted using the ideas from the last two sections. 
Although this has not been tested, it is expected to 
provide a significant performance improvement. 
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The basic idea is to have the parent SEC process 
use ssh to spawn child SEC processes on different 
nodes. These nodes have rules files that handle a por- 
tion of the event stream. The logging mechanisms are 
set up to split the event streams to the nodes so that 
each node has to work on only a portion of the event 
stream. Even if the logs are not split across nodes, the 
reduced number of rules on each node is expected to 
allow greater throughput. 


This can be used in a cluster to allow each host 
to process its own event streams and report composite 
events to the parent SEC process for cross-machine 
correlation operations. 


Limitations 


Like any tool, SEC is not without its limitations. 
The serial nature of applying SEC’s rules limits its 
throughput. Some form of tree-structured mechanism 
for specifying the rules would allow faster application. 
One idea that struck me as interesting is the use of rip- 
ple-down rulesets for event correlation [Clark2000] 
that could simply the creation and maintenance or 
rulesets as well as speed up execution of complex cor- 
relation operations. 


As can be seen above, a number of idioms con- 
sist of mating a single rule to a more complex correla- 
tion rule to receive the desired result. This makes it 
easy to get lost in the interactions of more complex 
rulesets. I think more research into commonly used 
idioms, and the generation of new correlation opera- 
tions to support these idioms will improve the read- 
ability and maintainability of the correlation rules. 


The power provided by the use of Perl regular 
expressions is tempered by the inability to treat the 
event as a series of fields rather than a single entity. 
For example, I would prefer to parse the event line 
into a series of named fields, and use the presence, 
absence and content of those fields to make the deci- 
sions on what rules were executed. I think it would be 
more efficient and less error prone to come up with a 
standard form for the event messages and allow SEC 
to tie pattern matches to particular elements of the 
event rather then match the entire event. However, 
implementation of the mechanism may have to wait 
for the ““One True Standard for Event Reporting,” and 
I do not believe I will live long enough to see that 
become a reality. 


The choice of Perl as an implementation lan- 
guage is a major plus because it is a more widely 
known language than C among the audience for the 
SEC tool this increases the pool of contributers to the 
application. Also, Perl allows much more rapid devel- 
opment than C. However, using an interpreted lan- 
guage (even one turned into highly optimized byte- 
code) does cause a slowdown in execution speed com- 
pared to native executable. 


SEC does not magically parse timestamps. Its 
timing is based on the arrival time of the event. This 
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can be a problem in a large network if the travel time 
cannot be neglected in the event correlation operations. 


Future Directions 


Refinement of the available rule primitives and 
actions (e.g., the expire action) is an area for investi- 
gation. A number of idioms presented above are more 
difficult to use than I would like. In some cases these 
idioms could be made easier by adding new correla- 
tion types to the language. In other cases a mechanism 
for storing and retrieving redundant information (such 
as regular expressions and timing periods) will sim- 
plify the idioms. This may be external using a pre- 
processor such as filepp or m4, or may be an internal 
mechanism. 


Even though SEC development is ongoing, not 
every idea needs to be implemented in the core. Using 
available Perl modules and custom libraries it is possi- 
ble to create functions and routines to enhance the 
available functionality without making changes to the 
SEC core. Developing libraries of add-on routines — as 
well as standard ways of loading and accessing these 
routines is an ongoing project. This form of extension 
permits experimentation without bloating SEC’s core. 


I would like to see some work done in formaliz- 
ing the concept of rule segmentation and improving 
the ability to branch within the rule sets to decrease 
the time spent searching for applicable rules. 


Availability 


SEC is available from http://kodu.neti.ee/risto/sec/ . 


In addition to the resources at the primary SEC 
site above, a very good tutorial has been written by 
Jim Brown [Brown2003] and is available at: http:// 
sixshooter.v6.thrupoint.net/SEC-examples/article.html . 


An annotated collection of rules files is available 
from http://www.cs.umb.edu/rouilj/sec/sec_rules-1.0.tgz. 
This expands on the rules covered in this paper and 
provides the tools for the performance testing as well 
as a sample sshre file for the ssh correlation example. 


Conclusion 


SEC is a very flexible tool that allows many com- 
plex correlations to be specified. Many of these com- 
plex correlations can be used to model [Prewett] normal 
and abnormal sequences of events. Precise modeling of 
events reduces both the false positive and false negative 
rates easing the burden on system administrators. 


The increased accuracy of the model provided by 
SEC results in faster recognition of problems leading 
to reduced downtime, less stress and higher more con- 
sistent service levels. 


This paper has just scratched the surface of 
SEC’s capabilities. Refinements in rule idioms and 
linkage of SEC to databases are just a few of the 
future directions for this tool. Just as prior log analysis 
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applications such as logsurfer influenced the design 
and capabilities of SEC, I believe SEC will serve to 
foster research and push the envelope of current log 
analysis and event correlation. 
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ABSTRACT 


Automatic fault diagnosis is an important problem for system management. In this paper, we 
combine high level symptom descriptions and low level state information to solve the system fault 
diagnosis problem. We extract state-symptom correlation information from knowledge sources in 
text format, and then use symptom similarity to rank the candidate system states. We apply the 
method to Windows Registry problems to help Product Support Service (PSS) engineers. 
Promising results with two different knowledge sources show the robustness of our method. 
Finally, we explain why this combination is successful and also discuss its limitations. 


Introduction 


Configuration management will remain a persis- 
tent problem “‘as long as people change how they want 
to use the system” [Ande95]. Change and Configura- 
tion Management and Support (CCMS) of computer 
systems with large install bases and large numbers of 
available third-party software packages have proved to 
be daunting tasks [LCO1]. Jim Gray depicted Trouble- 
Free System as an important goal of IT research: build 
a system used by millions of people each day, and yet 
administered and managed by a single part-time per- 
son [Gray03]. To achieve this goal, systems should be 
self-managing. Redstone and coworkers [RSB03] 
described a global-scale automated problem diagnosis 
system that collects problem symptoms from users’ 
desktops, and then automatically searches global data- 
bases of problem symptoms and fixes. We address 
similar problems in a new way in this paper. 


People typically use two different strategies to 
diagnose system faults: symptom-based approach and 
state-based approach. Nowadays, many systems have 
knowledge databases of their known problems online 
(such as [Apple], [BugNet], [MSKB] and [Redhat]). 
Computer users typically use symptom-based analysis 
to troubleshoot configuration problems. They describe 
their problems with words, and use information 
retrieval tools to find documents containing solutions 
to the problem. Considering that most customers are 
not PC experts, their problem descriptions are usually 
inaccurate and thus using them directly to retrieve rel- 
evant documents often yields unsatisfactory results. 


At the other extreme, many tools attempt to auto- 
mate the fault diagnosis task using low level machine 
states (such as [CKF02] and [Qie03]). Such tools 
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usually provide a language to specify the expected 
behavior of the system, use monitors to detect system 
deviation from the rule, and define actions to correct 
them. For example, Strider [W03] uses various tech- 
niques to narrow down the list of candidate root 
causes, including persistent state differencing, runtime 
tracing, intersection and statistical ranking. It then 
uses configuration roll-back [SROO] to fix the prob- 
lem. Unfortunately, in many cases, the ranking results 
are not satisfactory. Furthermore, its differencing step 
and tracing step are not always feasible. 


In this study, we combine both high level symp- 
tom descriptions and low level state information to 
solve the configuration fault diagnosis problem. The 
idea is to extract correlation information between low 
level states and high level symptom from knowledge 
sources, and then use symptom similarity to rank the 
states. We apply the method to Windows Registry 
problems [Gan04]. Promising results with two differ- 
ent knowledge sources show the robustness of our 
method. Finally we try to explain why the combina- 
tion is successful, and discuss its limitations. 


System Architecture and User Scenario in PSS 


The  state-symptom correlation information 
required for our problem solving technique is 
extracted from various text-based knowledge sources, 
such as the Product Support Service (PSS) log and the 
Microsoft Knowledge Base (KB). The information is 
then stored in a database called the PC-Genomics 
Database. Figure | illustrates a scenario of how to use 
PC-Genomics technique for more effective problem 
troubleshooting. 


For example a user named Diana cannot find any 
fonts in the font dialog box. This is because the 
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registry keys that list TrueType fonts are damaged. 
The processes from step (1) through step (7) show 
how her problem is solved: 

e Step 1: Diana reports the problem to PSS. She 
goes to http://support.microsoft.com and describes 
the problem with a short paragraph. 

Step 2: The PSS engineer initially tries to diag- 
nose the problem using the normal method. If 
this works, go directly to step 7. 

Step 3: The state collection and analysis tools 
are downloaded from the site to Diana’s 
machine. 

Step 4: Diana runs Strider to compare bad 
states and good states in Restore points, and the 
trace log is also produced. 

Step 5: A candidate set containing possible 
incorrect states is generated from this collected 
data and sent back to PSS. 

Step 6: The candidate set is fed in to PC 
Genomics database to figure out the root cause 
of the problem. 

Step 7: The generated solution is sent back to 
the Diana. It could be either a solution script, an 
executable or related KB articles. In this case, 
she receives a solution script which deletes the 
key: key_local_machine\software\microsoft\ 
windows nt\currentversion\fonts . 


Extracting State-Symptom Correlation from 
Knowledge Sources 
The PC-Genomics Database 


Obtaining the specific data needed for the PC- 
Genomics Database requires different techniques for 
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different data sources. The digital knowledge sources 
we use usually consist of articles in free text form. 
Although we are far from being able to understand the 
meaning of these articles automatically, we can iden- 
tify state names and the portion of text which is speci- 
fying the problem symptom in these articles. This 
state-symptom co-occurrence is crucial information to 
link states with their symptoms. Sometimes the corre- 
sponding software name and resolution can also be 
identified, and the data extracted for the PC-Genomics 
Database will have the form of Table 1. 


The Registry Dictionary 


The Windows Registry is the main configuration 
state store on PCs. It has a tree like structure, and each 
piece of configuration state is specified by a path 
name and optionally a value name. Either a path name 
or a value name prefixed by its path name is called an 
entry, and there are typically more than 200,000 reg- 
istry entries on a machine [SR]. To locate registry 
entries within free format text, we first collected all 
the registry entries from 50 PCs. They contained 
898,546 unique registry entries after name canonical- 
ization (e.g., substituting different user IDs, like 
“*s-1-5-21-...”’ with the string ““UID” in the registry 
entry path). Then they are used as a “registry dictio- 
nary” to help recognize registry entries within free 
format text. 

The Knowledge Sources 

The PSS log is an archive of problem-solving 

cases maintained by Microsoft Corporation. Each case 


contains the exchanged emails between a customer 
and a support engineer (see Figure 2). The total PSS 


(1)Report Problem with Symptom 


( 
Http ://support.microsoft.com 


(2)Regular Diagnosis 


PSS Log or PC-Genomics 
KB Articles Database 


Bownload Activex 


(7 )Solution 


6)Candidate state set 





User 


(4) Run D@wnloaded Tools 


State Analysis 


Restore Trace Log 
Points 


Figure 1: Architecture of PC-Genomics troubleshooting. 


ID State (Registry Key) 


HKLM\Software\Policies\ 
Microsoft\Messenger\Client 


Symptom 


Cap BOLAUO Messenger | delete PreventRunregister item 
Windows Messenger... 


Software Solution 


N 


| HKLM\Software\classes\clsid\ System Restore 


‘ Windows 1. Go to “Start>Run” 
{f414-a00bb8 } \inprocserver32 GUI is blank... Ye 


Table 1: Format of the PC genomics database. 
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log body contains more than 10 million cases. We 
used 2,311,492 cases in our experiment. These cases 
cover 15 products within six product families, ranging 
in time from 3/20/1997 to 5/13/2003 (see Table 2). We 


Contact: Dina 
System: WIN98 win 98 4.10 


Problem: _ All of my true type fonts have vanished from the 
font dialog box 

none nn nnnn-+------------ miail------------------------- 

Dear Dina, 

There are two things that we need to check. First... 
Sincerely, 

Gary 

woe enna nena anne nnn nee- mail------------------------- 

Dear Gary, 

Dina 

anon naan n-ne een ------- miail------------------------- 

Dear Dina, 


I’m going to close your case as successfully solved. Thank you 
for choosing Microsoft. 

Sincerely, 

Support Engineer 


SUMMARY 

<<§ ymptom>> 

TrueType fonts may not be present in the Fonts folder. 
<<Cause>> 

The registry key that lists TrueType fonts may be damaged or 
missing. 

<<Resolution>> 

Delete the Fonts key and then add it again under: 

hkey local machine\software\microsoft\windows 
nt\currentversion 


Figure 2: Sample emails in PSS log. 


combine the action and result part of the summary as 


the symptom of a case. If a registry key is referenced 
in the final mail message of a case, this entry and the 
case symptom were added into the PC-Genomics 
Database as a pair. We found that 143,157 of PSS log 
cases referenced 4,837 unique registry entries, and 
1,913 of them are registry values. 









VB Enterprise 6.0 Win32 EN 
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Q329134: Print or Edit Dialog Boxes May Not Appear in 
Internet Explorer 


The information in this article applies to: 
Microsoft Internet Explorer version 6 for Windows 2000 
Microsoft Internet Explorer 5.5 for Windows 2000 SP 2 


SYMPTOMS 

When you click Print or Print Preview on the File menu or 
click Find on the Edit menu in Internet Explorer, the Print and 
Edit dialog boxes do not appear. 


CAUSE 
This problem occurs if a corrupted value exists in the registry 
that may have been written by a third-party installation 
program. 


RESOLUTION 

1. Click Start, click Run, type regedit in the Open box, and 
then click OK. 

2. Locate and then click the following registry key: 

HKEY CLASSES ROOT\CLSID\{00020420-0000-0000-CO 

00-000000000046 }\InprocServer32 

3. In the mght pane, right-click InprocServer32, and then 
click Delete. 


Figure 3: A Sample KB Article. 


The Microsoft Knowledge Base [MSKB] contains 
troubleshooting articles written by experienced engi- 
neers (see Figure 3). Our data consist of 142,448 arti- 
cles ranging from Q10022 to Q332210. The KB articles 
are written in well formed XML format, so it is easy to 
parse their symptom section. A registry key found any- 
where in the article is added into the PC-Genomics 
Database with the corresponding symptom. We found 
that 1,921 of the KB articles reference 996 unique reg- 
istry entries and 412 of them are registry values. 


Rank States Using State-Symptom Correlations 
Symptom Similarity 
A state-based tool, like Strider, can generate a set 


of states as candidates for the root cause of a given 
problem. Our approach matches the symptom of the 















338,379 





Table 2: The PSS log used as a knowledge source. 
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problem with those symptoms of each candidate state 
in the PC-Genomic database. In this way, we can esti- 
mate the similarity between current problem and 
recorded previous problems (see Figure 4). We 
employ traditional ¢/-idf and Cosine [V79] measures 
from information retrieval to calculate similarity val- 
ues. In the database, all the symptoms of a root cause 
are combined as a mixed symptom S; = S;; + S;> 
+---+S;,. The current symptom Soyrren, Or each of 
the recorded symptoms S; is represented as a vector of 
term frequency. The basic assumption here is that if a 
state is a good candidate to the current problem, it is 
highly likely that it caused some problems with simi- 
lar symptoms in the past. The calculated similarity 
values are used to rank the candidate registry entries. 


Intersection-Ranking, Diff-Ranking & Trace- 
Ranking 


We collected 74 registry-related real-world prob- 
lems reported by our colleagues and users of web sup- 
port forums. These problems are independent from the 
PC-Genomics Database we are building. For each prob- 
lem, we recorded its symptom description, trace set, 
and differencing set. We also calculated the intersection 
set and root cause rankings with the Strider tool. There 
are three points to apply our ranking schemes. First, we 
can apply ranking on the intersection-set and it is called 
intersection-ranking. This ranking is expected to pro- 
duce better result since the size of intersection is rela- 
tively small and the ranking cost is also small. Second, 
when no trace data is available, we can directly apply 
diff-ranking to the diffing set. Finally, when no diffing 
data is available, we can also directly rank the trace set, 
which is called trace-ranking. 

Relaxed Root Cause Matching in the Database 

If both the value name and path name of a root 

cause can be matched in the PC-Genomics database, 


we call it value-matched. If only the path name is 
matched, we call it path-matched. Otherwise, we call 





Candidate Root Causes and Their Related Symptoms 


Lao, Wen, Ma, and Wang 


it not-matched. For example, the configuration state 
with path name “hkey_ classes _root\.jpg,’’ and value 
name ‘(Default)’? is considered value-matched, if 
“hkey_classes_root\.jpg\(Default)” can be found in 
the database. Or else if ““hkey_ classes _root\.jpg”’ is in 
the database, it is called path-matched. If none of them 
are in the data base, this configure is considered to be 
not-matched. To our experience, this relaxed matching 
criterion can increase the problem coverage of our 
method and do little harm to its accuracy. 


Because the “registry dictionary” covers only 66 
of the 74 root causes, the remaining eight root causes 
could not be recognized in the knowledge source free 
text. The PC Genomic database extracted from PSS 
log covers the root causes of 59 problems, while the 
database from KB covers 37. 


Ranking Result 


The result of a diagnosis processes is actually a 
rank of candidate root causes, ordered by their likeli- 
hood of being the actual one. Obviously, the real root 
cause should be ranked as high as possible in order for 
this approach to be effective. Usually, a ranking less 
than 5 is preferred. For each method, we sorted the 
cases by the ranking of their actual root causes. With 
these ranking curves, we can easily compare the diag- 
nostic effectiveness of different methods. 


Figure 5 contains the ranking curves using PSS 
log. We can see that our method efficiently increases 
the diagnosis accuracy with intersection data. Only 
nine out of 59 real root-causes rank more than 5. How- 
ever, ranking with only differencing data or trace data 
is not very accurate. 


Figure 6 contains the ranking curves using KB. 
The ranking curves show that our knowledge database 
from KB articles still gives good accuracy with inter- 
section data and differencing data, but the ranking 
with trace data deteriorates a lot. 


Figure 4: Symptom similarity for root cause ranking. 
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Discussion 


In this section, we discuss the strengths and 
weaknesses of the method. 


One-to-many Mappings Between Symptoms and 
States 


Problems with the same or similar symptom(s) 
may be caused by different registry entries. For 
instance, we manually checked the PSS log and found 
that 17 entries have been reported to cause the “Can- 
not open Word document” problem (see Table 3). If 
we use only the symptom-based methods for trou- 
bleshooting, we will get multiple possible root causes 
for a problem. In our approach, state information, like 
a filter that is orthogonal to the symptom description, 
can point out the root cause efficiently. 


Problem Coverage 


The effectiveness of our approach depends on the 
problem coverage of our database. We need two things 


—S— Trace-Ranking 
—A&— Strider 
20 


10 


Rank 


oI 
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to achieve this goal: building a “registry dictionary” 
with good problem coverage, and extracting information 
from a knowledge source of good problem coverage. 


In PSS log data, only about 0.5% registry entries 
are ever reported to cause problems (i.e., 4,837 of 
898,546). In KB data, only about 0.1% entries are ever 
reported (i.e., 996 of 898,546). If we consider these 
entries as a filter, they would be very efficient in scal- 
ing down the candidate root cause set. 


Among the 4,837 fragile registry entries from the 
PSS log, only a few entries cause problems frequently. 
Most entries have small numbers of occurrences (see 
Figure 7). They approximately follow a Zipf distribu- 
tion [Z49]. Even if the PC-Genomics Database con- 
tains only a portion of the known problems, it can still 
greatly reduce the real-world enterprise support cost 
because the most costly problems are well covered. 


Since the state-symptom database is flexible, the 
coverage problem can be further alleviated. If we find 


—A— Diff-Ranking 
Intersection-Ranking 








Case Ordered by Their Rank 
Figure 5: PSS log ranking results. 


—S— Trace-Ranking 
—A— Strider-Ranking 






Rank 






—Aé— Diff-Ranking 






Intersection-Ranking 





Case Ordered by Their Rank 
Figure 6: KB ranking results. 
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a common problem outside the dictionary or knowl- 
edge source, we can simply add its symptom and state 
into the database manually. But as long as we can only 
handle problems for which the root-cause entries have 
been found before, Strider needs to be used to find 
those root causes for the first time. 


The Percentage of Registry-Related Problems 


Directly calculating the percentage of registry- 
related problems from the number of cases which have 
cited a registry key yields a very small number: 6.2% 
(1.e., 143,157 of 2,311,492). One reason behind this is 
that the engineers often cite a registry related KB arti- 
cle and ask the user to do what the article says without 
specifying the registry key name. Another big reason 
is that finding root causes of Registry problems were 
extremely hard before Strider. 


With the help of KB information for each case, 
we can get a better estimation. When a case is closed, 
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the engineer is required to cite a KB article ID as the 
description of the case. About 8.8% of KB articles 
(i.e., 12,464 of 142,448) contain the keyword “‘reg- 
istry.” These “registry” KB articles cover 27% of the 
cases that ever cite KB. We manually verified 40 of 
these KB articles, and found that 15 of them are not 
actually registry problems. They may be just provid- 
ing registry-related knowledge or using registry as a 
problem solving method. Thus, the overall registry 
problem percentage is approximately 17% (i.e., 
27%*25/40). 


Summary 


We have proposed a novel solution to combine 
the traditional symptom-based troubleshooting method 
and relatively new _ state-based troubleshooting 
method. It adds some overhead of data collection to 
the user, but it can solve some previously hard-to- 
solve problems. So we prefer to treat the PC- 


hkey_users\.default\Software\Microsoft\Office\8.0\Outlook\Options\Mail 
hkey_current_user\software\microsoft\office\9.0\word\data\toolbars 
hkey_current_user\software\microsoft\office\9.0\word\data\settings 
hkey_current_user\software\microsoft\office\10.0\word\data\settings 
hkey_current_user\software\microsoft\office\10.0\word\data\toolbars 
hkey_local_machine\SOFTWARE\Microsoft\Internet Explorer\Plugin 


hkey_current_user\environment 


hkey_local_machine\System\CurrentControlSet\Services\Inetinfo\Parameters\MIMEMap 


hkey_classes_root\.doc\Content Type 
hkey_classes_root\mime\database\content type 
hkey_classes_root\MIME\DATABASE\Charset 
hkey_classes_root\MIME\DATABASE\Codepage 
hkey_classes_root\word.document 
hkey_classes_root\excel.sheet.8\shell\open\command 


hkey_current_user\software\microsoft\office\9.0\common\general\startup 
hkey_local_machine\software\microsoft\shared tools\text converters\import 
hkey_ local machine\software\microsoft\shared tools\text converters\import\wordperfect6\&x 


hkey_classes_root\excel.sheet.8\shell\open\ddeexec 


hkey_current_user\software\microsoft\shared tools\outlook\journaling\microsoft word\autojournaled 
Table 3: Registry key root causes of ““Cannot open Word document” problem. 
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Figure 7: Occurrence of 1,913 value-matched registry entries in the PSS Log. 
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Genomics technique as a backup for the regular meth- 
ods and to use it only when regular methods fail. Our 
future work will exploit other types of system infor- 
mation which can give the troubleshooting process 
better problem coverage and make it more automatic. 


Acknowledgement 


would like to express our sincere thanks to our 
shepherd AZleen Frisch for her valuable feedback, to 
Aaron Johnson, Chad Verbowski, and Archana Ganap- 
athi for their analysis of the test cases, and to the many 
colleagues who helped collect the registry snapshots 
from 50 computers. 


Author Information 


Ni Lao is currently a graduate student in School 
of Software at Tsinghua University in China. He 
received his B.S. in Electronic Engineering from 
Tsinghua University in 2003. His current research 
focuses on automated system management using data 
mining and pattern recognition methods. He can be 
reached at noon99@mails.tsinghua.edu.cn. 


Ji-Rong Wen is a researcher in Microsoft 
Research Asia. He received his Ph.D. in Computer 
Science from the Institute of Computing Technology, 
the Chinese Academy of Science in 1999. He joined 
Microsoft in July 1999. His current research interests 
are data management, information retrieval (especially 
Web search), data mining and system management. 


Wei-Ying Ma received the B.S. degree in electri- 
cal engineering from the national Tsing Hua Univer- 
sity in Taiwan in 1990, and the M.S. and Ph.D. degrees 
in electrical and computer engineering from the Uni- 
versity of California at Santa Barbara in 1994 and 
1997, respectively. He joined Microsoft Research Asia 
in April 2001 as the Research Manager of the Informa- 
tion Management and Systems Group. Prior to joining 
Microsoft, he was with Hewlett-Packard Laboratories 
at Palo Alto. From 1994 to 1997 he was engaged in the 
Alexandria Digital Library (ADL) project in Univer- 
sity of California at Santa Barbara while completing 
his Ph.D. Dr. Wei-Ying Ma serves as an Editor for the 
ACM Multimedia System Journal and Associate Edi- 
tor for the Journal of Multimedia Tools and Applica- 
tions published by Kluwer Academic Publishers. His 
research interests include Internet search, information 
retrieval, content-based image retrieval, intelligent 
information systems, adaptive content delivery, and 
media distribution and services networks. 


Yi-Min Wang is the manager of the Systems 
Management Research Group at Microsoft Research, 
Redmond. He received his Ph.D. in Electrical and 
Computer Engineering from University of Illinois at 
Urbana-Champaign in 1993, worked at AT&T Bell 
Labs from 1993 to 1997, and joined Microsoft in 
1998. His research interests include systems and secu- 
rity management, fault tolerance, home networking, 
and distributed systems. 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


Combining High Level Symptom Descriptions ... 


References 


[Ande95] Anderson, E. and D. Patterson, “A Retro- 
spective on Twelve Years of LISA Proceedings,” 
Proceedings of the Thirteenth Systems Adminis- 
tration Conference (LISA XIII), USENIX, p. 95, 
1999, 

[Apple] Knowledge Base, http://kbase.info.apple.com . 

[BugNet] BugNet, BugNet, http://www.bugnet.com. 

[CKF02] Chen, M., E. Kiciman, E. Fratkin, A. Fox, 
and E. Brewer, “Pinpoint: Problem Determina- 
tion in Large, Dynamic, Internet Services,” Proc. 
Int. Conf. on Dependable Systems and Networks 
(IPDS Track), 2002. 

[Gan04] Ganapathi, A., Yi-Min Wang, Ni Lao, and Ji- 
Rong Wen, ““Why PCs Are Fragile and What We 
Can Do About It: A Study of Windows Registry 
Problems,” to appear in Proc. IEEE DSN/DCC, 
June 2004. 

[Gray03] Gray, J., ““What Next? A Dozen Informa- 
tion-Technology Research Goals,” Journal of the 
ACM, Vol. 50, Num. 1, pp. 41-57, January 2003. 

[LCO1] Larsson, M. and I. Crnkovic, “Configuration 
Management for Component-based Systems,” 
Proc. Int. Conf. on Software Engineering (ICSE), 
May 2001. 

[MSKB] Microsoft Corporation, Microsoft Knowledge 
Base, http://support.microsoft.com . 

[Qie03] Qie, X.-H., Sanjai Narain, “Using Service 
Grammar to Diagnose BGP Configuration 
Errors,” Proc. Usenix Large Installation Systems 
Administration (LISA) Conference, pp. 237-246, 
October 2003. 

[Redhat] Redhat Corporation, Redhat Support Forums, 
http://www.redhat.com/support/knowledgebase/ 
forums/. 

[RSB03] Redstone, J. A., M. M. Swift, B. N. Bershad, 
“Using Computers to Diagnose Computer Prob- 
lems,” Proc. HotOS, 2003. 

[SR] Windows XP System Restore, http://msdn. 
microsoft.com/library/default.asp?url=/library/en- 
us/dnwxp/html/windowsxpsystemrestore.asp . 

[SROO] Solomon, D. A. and M. Russinovich, /nside 
Microsoft Windows 2000, Microsoft Press, Third 
Edition, Sept 2000. 

[V79] van Rijsbergen, C. J., Jnformation retrieval, 
Butterworths, Second Edition, London, 1979. 

[W03] Wang, Y.-M., Verbowski, Chad., Dunagan, J., 
Chen, Y., Wang, H. J., Yuan, C., and Zhang, Z., 
“STRIDER: A Black-box, State-based Approach 
to Change and Configuration Management and 
Support,” Proc. Usenix Large Installation Sys- 
tems Administration (LISA) Conference, pp. 
159-171, October 2003. 

[Z49] Zipf, G. K., Human Behavior and Principle of 
Least Effort: An Introduction to Human Ecol- 
ogy,” Addison Wesley, Cambridge, MA, 1949. 


157 


enn’ bth J AN a ae S, : | ee a oe 


ova Gi é ~? - / - - ; a . | | payielet a es 

+S \@ ' oil & - P ali »* a s j i rr on 

e aa : = 
(ee r,s e | eae Ta) ho || BR oma MGA eR Bri otal - | 
aes we ® =e a OP teeth pie a 4 an wmNES r 
sa 4 on et Faw fh Ai cotiten —oglh : 


ET A pag) Sa “A lym Ra gh asin ain 







. w dz | thar x, tw a . a 
3 yet wp acai by wat oR athate ee. 1 phn Lived 
a < , bie 3 34 ile 
= “gt he i. 7 ae a: at BD on. Ly 2 oi a ne @whe a . | 
4% 


rz. “craoescmsl ar i er fh) a | i 
7 : - * Ga © & o +3 ero ke Re Mae 
ee I een | agg? wll ja ete stsid, Oo Set \ obn “ure g'41)°'l ce 
am we 4a | 's 
ra : '° Ae at " ci Seg! Pigs © ox <i) Mp: oy ry, be 
Shera t ans | heh ae ie a tah oo 
ie a % eral | penn er ie ise 
Re ‘te . 
ms | a; ( - - a 
4 a ; ri ‘i ‘ «= & e od ze > #3 i wit 
a < Fie / | wk , aot : gly oe " * ne gh" 4 : ot e i, 
_ is : @ 2 > (08s 









° ) j : 
. ; - - 7 2% dr i. a _*i. ¥ iruen sant’. af - 
is =e RS » et # a _ hy - cat we x! vite a tp re r ce iit i oe » fhe ome - 
; ti lhy } 7 Hivee fd len 7 “sa va g @* oo, a pes, +% “? = vagite 7 
7 a 1 .aPat a ¥ heey / 7 dp ted “i. cb Mg! beat wh; * gion, - . > 
a i teh A Pil tai 1 eT A *¥ a i - - j i = 4 - “4 7 4 
aT a. ee ee 3 + Meee pe eee ; 
ae i. tang nt ihe om =) Pine vialhe E Tipe OE, = 8 Wiig, Hs * | 
= jwe a i ae er at ell sweat ails ee wets. 6m 
¥. | : % »* ‘= 0 (DE « es: ~ saat ‘a 
oe FOE ee = as ae ad fe Fh © Ried, ie pe! O05 wee ” a ee 
= Ph i - be 
- 7 a » j 7%, ese Te BAe’? Apilt® oha we ® 3.6 
_ Bet * ae a tas =" 4 Pe, er a ee me a vei 
‘ | m « ‘ j vo, ' : 
* ig Wa A eet Na el ne ec 
. i »” 8 i a 4 ec! P45) Fe! i) @ * b4 * 7 FA “4 
= ae oe : ae ae * ‘ save wy i 2 » Aim, t a Z i % ah | “e GT te n= 
ot ie él +?’ +. ye ate af 
~ ypriey’ lh sl Atte "AM Wok at ye ‘s ee ee ie a z, m ny. tapas ote 
iy Sascebuons yp Bi. Be Fila | - 7 Has hs “2, She 
“ge - mw * : - wd =) =.) avath: ¥ _— om i, = om a 
; Be : ea a ah “wth APP ay ep a el ee 
PRE ae 4 wh dir, oe s chat WA  pt a lie Apteit * eran | 
= / ! : > a” 
: 7 


mae oys | ee , 2; one |. ; 
i sh me as mate a gt hae de . l mit ot vowglen 
Tha! ae Pu. - ae oy Cy, ©, _ “inn yw 
as atoriw sche pe a es * HS “4 * 45) a 2 ve av ‘ te1 "= Fe s realy 
Me seitiiate: Ue na SP: ie, Nae eee han. Tele” he gen eT 
ee hy ee Wie Fi Be an ' "NT ue Et * 
27 BH | - “er al ai: ob ‘ iE ete eve: ee'oar" 
Benet, tata -“ ae - oT lh gilige: eet 
: 1 Pe t edgy Fi5ed4 el 4 i if im ‘a. me oe orang, fait —s 


lew Me wee ott ifs n, © ei Rip, “th is 
ey (Pibe ie | e 7 | my 1 wine 





Ps Ce 
det moe gan eeliarva: * B Bel wi Ee bo gw 
‘w’Wlh ee. ; 79 ‘6 e'4 2 7 oe dy a. - 
« Kes cot es “i a 4 tne gf OP Gh 6 2% 74 1 bah 
Bp ath a AG, SL ge TP on One Bec ne e 4 
IM t.t °. = : f. teh i oy ; a a eae 
a. Fada etre © tte e-8 Che eee) rey 
she oe ora! ala. * “. § Pete e 2! Pigiet ‘ee Pe sa <a y ee 
ee 2's a ‘vee ; 4 os sel | = Dye ot ” “ iai" ; , Th." | Ces 1 j 
Oo MOVE OF ke, 
tog oar ; ee 
=" 5 i . , / 
mr * ° 5 cata gh? -eeeaat ~ fom a Waste ¢ eye 
‘ s 


=: ¥ 
ae oe a ee a 7 mit eee , 





LifeBoat — An Autonomic 
Backup and Restore Solution 


Ted Bonkenburg, Dejan Diklic, Benjamin Reed, Mark Smith — IBM Almaden Research Center 


Michael Vanover — IBM PCD 
Steve Welch, Roger Williams —1BM Almaden Research Center 


ABSTRACT 


We present an innovative backup and restore solution, called LifeBoat, for Windows 
machines. Our solution provides for local and remote backups and “‘bare metal” restores. Classic 
backup systems do a file system backup and require the machine to be installed before the system 
can be restored or they do a block for block backup of the system image which allows for a “bare 
metal’’ restore, but makes it hard to access individual files in the backup. Our solution does a file 
system backup while still allowing the system to be completely restored onto a new hard drive. 


Windows presents some particularly difficult problems during both backup and restore. We 
describe the information we store during backup to enable the “‘bare metal” restore. We also 
describe some of the problems we ran into and how we overcame them. Not only do we provide a 
way to restore a machine, but we also describe the rescue environment which allows machine 
diagnostics and recovery of files that were not backed up. This paper presents an autonomic 
workgroup backup solution called LifeBoat that increases the “Built-In Value”’ of the PC without 
adding hardware, administrative cost, or complexity. LifeBoat applies autonomic principles to the 


age old problem of data backup and recovery. 


Introduction 


Supporting PC clients currently represents 
roughly 50% of overall IT cost (IGS 2001). This num- 
ber is larger than both server (30%) and network 
related costs (20%). This provides the motivation for 
an autonomic approach to reducing the cost of PC 
clients. So far, thin clients have repeatedly failed in the 
marketplace. IT attempts to “lock down” PC clients 
have not been accepted. In addition, attempts to con- 
trol the client from the server have failed due to the 
fact that clients sometimes get disconnected. Fat 
clients, however, continue to prosper and increase in 
complexity which drives the maintenance cost up. We 
believe autonomic clients are critical components of 
an overall autonomic computing infrastructure. They 
will help lower the overall cost of ownership and 
reduce the client down time for corporations. 


The secure autonomic workgroup backup and 
recovery system, LifeBoat, provides data recovery and 
reliability to a workgroup while reducing administra- 
tive costs for Windows 2000/XP machines. LifeBoat 
provides a comprehensive backup solution including 
backing up data across the peer workstations of a 
workgroup, centralized server backup, and _ local 
backup for disconnected operation. In addition, it pro- 
vides a complete rescue and recovery environment 
which allows end users to easily and conveniently 
restore downed machines. The LifeBoat project 
increases the Built-in Value of the PC without adding 
hardware, administrative, or complexity costs. By 
leveraging several autonomic technologies, the LifeBoat 
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project increases utility while reducing administrative 
cost. 


In this paper we first describe the backup portion 
of LifeBoat. This is split into two sections, the first of 
which focuses on network backup. LifeBoat leverages 
a research technology called StorageNet to seamlessly 
spread backup data across the workstations of a work- 
group in a peer-to-peer fashion. We then create a scal- 
able road map from workgroup peer-to-peer to a cen- 
trally managed IT solution. The second backup section 
focuses on backing up to locally attached devices 
which is a requirement for disconnected operation. We 
then describe the complete rescue and recovery 
process which simplifies recovery of files and directo- 
ries as well as providing disaster recovery from total 
disk failure. Next we describe a centralized manage- 
ment approach for LifeBoat and how LifeBoat can fit 
within a corporate environment. We conclude with 
performance measurements of some example backup 
and restore operations. 


Backup 


LifeBoat supports a number of backup targets 
such as network peers, a dedicated network server, and 
locally attached storage devices. The Autonomic 
Backup Program is responsible for creating a backup 
copy of a user’s file system in such a way as to be able 
to completely restore the system to its original operat- 
ing state. This means that the backup must include file 
data as well as file metadata such as file times, ACL 
information, ownership, and attributes. Our backups 
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are performed file-wise to enable users to restore or 
recover individual files without requiring the restora- 
tion of the entire machine such as with a block based 
solution. Finally, the backups are compressed on the 
fly in order to save space. 


The Autonomic Backup Program performs a 
backup by doing a depth first traversal of the user’s 
file system. As it comes to each file or directory, it 
creates a corresponding file in the backup and saves 
metadata information for the file in a separate file 
called attributes.ntfs which maintains the attributes for 
all files backed up. 


There is special processing required for open 
files locked by the OS. The backup client employs a 
kernel driver to obtain file handles for reading these 
locked files. This driver stays resident only for the 
duration of the backup. When a backup is completed, 
the client generates a metadata file to describe the file 
systems which have been backed up. This “usage” 
file contains partition, file system, drive lettering, and 
disk space information. The existence of the usage file 
indicates that the backup was successful. 


The output of the backup and format of the 
backup data depends on the target. For a network 
backup, the data is stored using a distributed file sys- 
tem known as StorageNet. StorageNet has some 
unique features which make it especially suited for our 
peer-to-peer and client-server backup solutions. For a 
backup to locally attached storage, the backup is 
stored in a Zip64 archive. 


StorageNet Overview 


The storage building block of our distributed file 
system, StorageNet, is an object storage device called 
SCARED [2] that organizes local storage into a flat 
namespace of objects identified by a 128-bit object id. 
A workstation becomes an object storage device when 
it runs the daemon to share some of its local storage 
with its peers. While the object disks we describe here 
are similar to other object based storage devices [2, 3, 
4, 8], our model has much richer semantics to allow it 
to run in a peer-to-peer environment. 


Clients request the creation of objects on 
SCARED devices. When an object is created, the 
device chooses an object id to identify the newly cre- 
ated object, marks the object as owned by the peer 
requesting creation, allocates space for it, and returns 
the object id to the client. Clients then use the object 
id as a handle to request operations to query, modify, 
and delete the object. 


An object consists of data, an access control list 
(ACL), and an info block. ACLs are enforced by the 
server so that only authorized clients access the 
objects. The info block is a variable sized attribute 
associated with each object that is atomically updated 
and read and written by the client. The info block is 
not interpreted by the storage device. 
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One special kind of object creation useful in 
backup applications is the linked creation of an object. 
We implement hard links by passing the object id of 
an existing object when requesting creation. A hard 
link shares the data and ACL of the linked object, but 
has its own info block. These linked objects allow us 
to not only create hard links to files, but also to direc- 
tories. Hard linked objects are not deleted until the last 
hard link to the object is deleted. 


There are two kinds of objects stored on 
SCARED devices. File objects have semantics similar 
to local files. They are a stream of bytes that can be 
read, written, and truncated. 


Directory objects are the other kind of object; 
they are an array of variable sized entries. Entries are 
identified by a unique 128-bit number, the etag, set by 
the daemon as well as a unique 128-bit number, the 
ltag, set by the client. The client chooses an Itag by 
hashing the name of the file or directory represented 
by an entry. The entry also has variable sized data 
associated with it that can be read and set atomically. 


Later we will describe how these objects are 
used to build a distributed file system, but here we 
need to point out that the storage devices only manage 
the allocation and access to the objects they store. 
They do not interpret the data in those objects, and 
thus, do not know the relationships between objects or 
know how the objects are positioned in the file system 
hierarchy. Because the data stored on the storage 
devices is not interpreted, the data can be encrypted at 
the client and stored encrypted on the storage devices. 


SCARED devices also track the allocations of 
objects for a given peer to enforce quotas. Later we will 
explain why quota support is needed, but for now it is 
important to note this requirement on the storage devices. 


Along with object management, storage devices 
also authenticate clients that access them. All commu- 
nication is done using a protocol that provides mutual 
authentication and allows identification of the client 
and enforcement of quotas and access control. Note 
that communication only occurs between the client 
and storage device; storage devices are never required 
to communicate with each other. 


Clients use data stored on the object storage 
devices to create a distributed file system. The clients 
use meta-data attached to each object and directory 
objects to construct the file system. The directory 
entries are used to construct the file system hierarchy 
and the info blocks are used to verify integrity. 


Figure | shows the layout of the directory entries 
as interpreted by the client. The first three fields are 
maintained by the storage device. The other fields are 
stored in the entry data and thus stored opaquely by 
the storage device. The client needs to store the file- 
name in the entry data since the Itag is the hash of the 
filename, which is useful for directory lookups, but 
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the actual filename is needed when doing directory 
listings. The other important piece of information is 
the location of the object represented by the entry or, if 
the entry is a symbolic link, the string representing the 
symbolic link which is stored in the entry data. 


Version |Filename 


Figure 1: Layout of the directory entry. 









Figure 2 shows a fragment of the distributed file 
system constructed using the structures outlined above. 
The first device contains two objects. The first object is 
a directory with two entries. The first entry represents a 
directory stored on the second device. The second direc- 
tory entry is a file that is stored on the same device. 


Peer-to-Peer Network Backup 


In the peer-to-peer case, our system backs up 
workstation data onto other workstations in the work- 
group. This is accomplished by defining a hidden par- 
tition on each workstation that can be used as a target 
of the backup. The architecture of the software compo- 
nents in the system is completely symmetric. Each 
workstation runs a copy of the client and the Stora- 
geNet server. In this way each station serves as both a 
backup source and target. In addition, each station runs 
a copy of the Lifeboat agent process. This always runs, 
provides, and serves the web user interface that consti- 
tutes the policy tool to allow the user to make changes 
to backup targets, select files for backup, and set 
scheduling times. At the appointed time, this process 
will invoke the backup client program as well. The 
hidden partition is created during the installation 
process and is completely managed by the StorageNet 
server on each station. The customer uses the client 
software to specify what data to backup and on what 
schedule. The target of the backup is determined by the 
system and can be changed by the customer on request. 


In the case of an incremental backup, our Stora- 
geNet distributed file system offers some very strong 
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advantages over traditional network file systems. For 
example, one feature which we use a great deal is the 
ability to create directory hard links. In this way, if an 
entire subtree of the file system remains unchanged 
between a base and an incremental backup, we can 
simply hard link the entire subtree to the correspond- 
ing subtree in the base backup. When individual files 
remain unchanged, but their siblings do not, we can 
hard link to the individual files, and create new 
backup files in the directory. This unique directory and 
file hard linking ability allows each backup in our file 
system, both base backup and incremental backups, to 
look like an entire mirror image of the file system on 
the machine being backed up. Each incremental 
backup only takes up the same amount of space as 
what has changed between backups. 


In the local case, incremental backups look like a 
subset of the file system. Pieces of the file system that 
did not change are simply not copied into the zip. In 
order to distinguish between files that are unchanged 
and files that have been deleted, we keep a list of files 
which have been deleted in “DeletedFiles.log.” This 
is used during the restore to know which files not to 
copy out of the base. 


For example, consider backing up the file hel- 
loworld.txt. In the remote scenario, this file is copied 
to our StorageNet distributed file system. The file- 
name, file data, file times, and file size are all set in 
the StorageNet file system. 


File dates and sizes are not stored redundantly in 
this case because the cost of looking them up later dur- 
ing an incremental is free. This is because during a 
remote incremental, we are also doing a depth first tra- 
versal on the base backup. File ACL, attributes, and 
ownership information is placed into attributes.ntfs 
file for use during restore. The short- name data is 
stored in the directory entry for this file. Although 
StorageNet has no 8.3 limitations, it makes provision 
for this information to maintain full compatibility with 
Windows file systems. 


7 Pes Info Block | 
EE Ss 


Entry data 


| | | Entry data 





Figure 2: An example file system fragment stored on directory objects on two storage devices. 
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Why We Need Quotas 


Since all the peers store their data remotely, if 
any peer fails it can recover from its remote backup. It 
is tempting to randomly spread a given peers backup 
across all of its peers. Spreading this way gives us 
some parallelism when doing backups and should 
speed our backup. However, if we do spread a given 
machine’s backup uniformly across its peers, we can- 
not tolerate two failures since the second failure will 
certainly lose backup data that cannot be recovered. 


Instead of spreading the data across peer 
machines, we try to minimize the number of peers 
used by a given machine for a backup. Thus, if all 
machines have the same size disks, when a second 
failure happens there will only be a $1 over (n-~1)$ 
chance that backup data is lost for a given machine. 


Unfortunately, we cannot assume that all peers 
have the same sized disks. Thus some peers may store 
the backup data of multiple clients, and other peers 
may use multiple peers to store their backup. If the 
disk sizes are such that a peer’s backup must be stored 
on multiple peers and those peers in turn store backups 
from multiple peers, the backups can easily degenerate 
into a uniform backup across all peers unless some 
form of quotas are used. 


Peer Backup Scenarios 


The number of scenarios that are supported by 
this solution is virtually innumerable. However, there 
are some attributes that constitute simple scenarios. 
For example, we can consider the most simple sce- 
nario in the peer-to-peer case to be the completely 
symmetric homogeneous case where all stations pro- 
vide a hidden partition that is equal in size to their 
own data partition, and each stations data is backed up 
to a neighboring station. Figure 3 shows an example 


for three workstations. 


Figure 3: Three workstation peer-to-peer case. 


In this case every machine backs up its data in 
the hidden partition of its neighbor. 





Figure 4: Non-homogenous/non-symmetric example. 
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Figure 4 shows a more complicated scenario. In 
this case, the following statements hold true for the 
backup group: 

© B holds all of A’s data and portions of the data 
of C 

e A holds all of B’s data and portions of the data 
of C and D 

© C holds all of E’s data and portions of the data 
of D 

e D could be a laptop and stores parts of its data 
on A and C 

e E holds all of F’s data 

© F (as well as possibly other stations) has avail- 
able target space for a new entrant in the group 


Either of these scenarios could have resulted 
from: 
© autonomic system decisions based on the sizes 
and allocations of a heterogeneous group of 
workstations 
° user-selection specifies the target of the backup 


Obviously these two scenarios are not exhaustive. 
Configurations of arbitrary complexity are supported. 
We intend to develop heuristics and user interface meth- 
ods to reduce possible complexity and allow the cus- 
tomer to efficiently manage the backup configuration. 


Dedicated Server Network Backup 


One of the big advantages of using dedicated 
servers as opposed to peers is the availability of ser- 
vice. Because peers are general purpose user 
machines, they may be turned off, rebooted, or discon- 
nected with a higher probability than with dedicated 
servers. In a large enterprise environment using a ded- 
icated server approach can guarantee backup availabil- 
ity. Machine stability is important when trying to do 
backups. Dedicated servers are also easier to manage 
because of their fixed function. Machines are also eas- 
ier to update and modify by an admin staff if they 
belong to the IT department rather than users. 


The dedicated server solution uses StorageNet in 
a similar fashion as the peer-to-peer approach. The 
dedicated server acts as the target StorageNet device 
for the backup clients and backup data is stored in the 
same fashion as the peer approach. Indeed, the archi- 
tecture makes no distinction between dedicated servers 
and peers. In this way, the dedicated server solution is 
only a special case peer-to-peer usage scenario. 


Local Backup 


For mobile users the ability to perform regular 
backups to local media is critical. There are several 
configurations that we must deal with in order to pro- 
vide local backup. The simplest one is for a system 
with one internal hard drive which contains the data 
we wish to back up and one additional hard drive 
where the backup is stored. The hard drive containing 
the backup can be either an external USB/Firewire 
drive or internal hard drive. The user is also allowed to 
perform backup locally to the source hard drive. In 
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this case we use a file system filter driver to protect 
the backup files. While this form of backup wont pro- 
tect the user from hard drive failures, it will allow 
recovery from viral attacks or software error. 


The format of the backup is rather simple. We 
use a simple directory structure. The main directory is 
called LifeBoat Local and for every machine backed 
up to the drive we add another sub directory. This sub 
directory, for example test, will contain multiple direc- 
tories and files. The most important file contains the 
UUID of the machine that is backed up and is called 
the machine file. It contains the serial number of the 
machine and UUID as returned by DMI [1]. We use 
this file during the restore procedure to automatically 
detect backups. A sample machine file is in Figure 5. 


2658N5U AKVAA2W 
OOF7D68B-O0AA0-D611-88F2 -EDDCAE30B833 


Figure 5: Typical machine file. 


The first time a local backup is run, we create a direc- 
tory called base and place it in LifeBoat Localest. 
Additional backups are placed in directories called 
Incremental 1, Incremental 2, etc. The number of 
incremental backups is user configurable, with the 
default value set at five. The full directory structure 
can be seen in Figure 6. Each of the directories such as 
base and Incremental 1 contain the following files: 
usage, attributes.ntfs, backup.|st, and some zip files. In 
the case of a filesystem that is less than 4 GB com- 
pressed, a single zip file, backup.zip suffices. Other- 
wise, the Zip64 spanning standard is used. 


=) |) LifeBoat Local 
=) \->) test 
\_) base 
\_) Incremental 1 


(2) Incremental 2 
Figure 6: Typical directory structure. 


The first line of the usage file lists descriptions 
of columns inside the usage file: drive letter, file sys- 
tem type, size of the partition, amount of used space 
and amount of backed up data. In the next line is an 
OS descriptor which is important for post processing 
after restore. Possible descriptors are WinXp, Win200, 
WinNT4.0, WinNT3.5, Win98, Win95 and WinME. 
Lines that follow give information about each partition 
in the system. They are used during restore process. 


The attributes.ntfs file is of importance only 
when backing up/restoring NTFS partitions and is not 
used if the partitions are not NTFS. The attributes.ntfs 
file contains all file attributes as well as ACL, SACL, 
OSID and GSID data. We write the data during 
backup and restore it during the restore post process- 
ing step. Backup.zip contains the actual backup of all 
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files. Through extending ZIP functionality to use the 
current Zip64 specification, we are able to create ZIP 
files that are very large, dwarfing the original 2 GB 
limit. If the backup is greater then 4 GB zipped we can 
create multiple backup files (backup.zip, backup.001, 
etc.) using the Zip64 spanning standard. We chose 4 
GB as our spanning limit in order to allow these files 
to be read by FAT32 files systems. 


For example, let’s imagine we are doing a base 
backup and come to the file “‘helloworld.txt” which 
contains data as well as some ACL information. This 
file would be added to the backup.zip file and com- 
pressed, taking care of the filename, file data, modifi- 
cation date, and file size. The file dates and size are 
also placed in a metadata file, backup.Ist, to be used 
later when creating incremental backups to determine 
whether the file has changed and needs to be backed 
up again. File ACL, attributes, and ownership infor- 
mation is placed into the attributes.ntfs file for use 
during restore. Finally, the short name for this file, for 
example “hellow™1.txt’, is stored in the comments 
section of the zip file. Preserving shortnames across 
backup and restore turns out to be very important even 
in later Windows versions. Some Windows applica- 
tions still expect the short names for files to not 
change unless the long filename changes as well. 


A special case of local backup is the backup to 
yourself case. In this case we have only one hard drive 
and we want to backup the data to the same drive we 
are backing up. In the simple case we have multiple 
partitions on the hard drive, for example we backup 
drive C to drive D. In a more complex case where we 
have to deal with a single partition we backup C to C. 
As far as backup is concerned this is not problematic, 
however during restore we have to deal with some 
very specific problems related to NTFS partitions and 
the lack of write support under Linux. 


Rescue and Recovery 


A significant portion of the LifeBoat project 
focuses on client rescue and recovery. This includes 
several UI features for Windows as well as a bootable 
Linux image. The rescue operations allow a user to per- 
form diagnostics and attempt to repair problems. Recov- 
ery enables the user to restore individual files or even 
perform a full restore in the case of massive disk failure. 


Single File Restore 


When the system is bootable, it is possible to 
restore a single file or a group of files from within Win- 
dows [6]. In keeping with the autonomic goal of the 
system, the user interfaces for this system are minimal. 
From Windows, the restore process uses a simple 
browser interface to StorageNet using the browser pro- 
tocol istp://. A screenshot of the istp protocol is below 
in Figure 7. We have also written a namespace exten- 
sion for StorageNet which behaves like the ftp names- 
pace extension which ships with Windows. An example 
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screen looks almost identical to that for ftp:// and uses 
the analogous Copy-Paste commands (see Figure 8). 







us 
= 


ps lO x 
| file Edit Yiew Favorites Tools Help | 
| Back > => - & [2) Gb) GQsearch : 
| Adéress [@) istp:joRavecuy{oxoo000000,0,00] @Go 


Index of C/ 










AFPPLGIN/ 
archives/ 
AUTOEXEC.BAT 
BOOT.BAK 
boot ini 


BOOTSECT.DOS 


=15) x! 


| File Edit View Favorites T Registr: >? 
| Back ~ =) ~ (2) | Qsearch Gy Folders 2 




























ools 











| Address |(g3) BRAVEGUY |] eco 
Type 

(3) AFPPLGIN Vault Folder 
(33) archives Vault Folder 
(3) AUTOEXEC, BAT 1KB Vault File 
(33) BOOT.BAK 1KB ault File 
(83) boot ini 1KB Vault File 
(#3) BOOTSECT.DOS 1KB Vault File 
(83) brpcntnt.GID 1KB Vault File 
(3) cebW2K.exe 1KB Vault File 
(33) cmdcons Yault Folder 
(3) cmidr 1KB Yault File 
(3) Config. Msi Yault Folder 


(3) CONFIG.SYS 1KB Vault File ra 
4 > | 


Registry: di 


Figure 8: StorageNet namespace extension. 


Rescue 


The LifeBoat Linux boot CD provides various 
software services that can be used for systems mainte- 
nance, rescue, and recovery. The distribution works in 
almost any PC and can be booted from a number of 
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devices such as a CD-ROM drive, USB keyfob, local 
hard drive, or even over the network. The CD includes 
over 101 MB of software including a 2.4.22 kernel, 
Xfree86 4.1, full network services for both PCI and 
PCMCIA cards and wireless connectivity. 


An important part in the design of the bootable 
Linux CD was rescue functionality. We wanted to pro- 
vide the user with at least a rudimentary set of func- 
tions which would enable him to diagnose, report, and 
fix the problem if at all possible. As part of the CD we 
included the following set of rescue functions: 
PCDoctor based diagnostics which lets us run an all 
encompassing array of hardware tests, AIM as way of 
quickly communicating with help available online, 
and the Mozilla web browser. We also developed an 
application which finds all bookmarks on the local 
drive, in the local backup, and in the remote backup 
and makes them available for use in Mozilla. This pro- 
vides the user with the list of bookmarks that he is 
used to. At the same time we add a selection of book- 
marks which can be custom tailored for a specific 
company to include their own links to local help desk 
sites and other useful resources. 


Even in the face of disaster, an important issue to 
keep in mind is that a damaged hard disk may still con- 
tain some usable data. In the case of a viral attack, boot 
sectors and system files could be compromised but the 
user data could be left intact. Performing a full system 
restore would overwrite any changes made since the 
last backup. For this case we created an application 
that browses through all documents that were recently 
accessed and allows the user to copy them to a safe 
medium such as a USB keyfob or hard drive. 


Recovery 


Full machine recovery is a vital part of any 
backup solution. The Lifeboat solution uses its 
bootable Linux CD for full machine restore. This is 
necessary when the machine cannot be booted to run 
the Windows based restore utilities. In order to use the 
CD for system recovery we added a Linux virtual file 
system (VFS) implementation for StorageNet. Located 
on the CD is our Rapid Restore Ultra application 
which is used to restore both local and remote back- 
ups. Rapid Restore Ultra is written in C and uses QT 
for the UI elements [9]. The application comes in two 
flavors. The first one is intended for a novice user that 
has no deep knowledge of systems management issues 
and just wishes to restore the data. The second version 
is intended for knowledgeable system administrators 
or advanced users that have deep knowledge of inter- 
nal systems functioning. The novice user just restores 
the latest backup and the application determines how 
the backup is to be restored. Advanced users can 
select any backup on the discoverable network or local 
devices, as well as forcing the discovery of backups 
on a non local networks by entering the IP address or 
name of a potential server. The user can then manually 
repartition the drives, and assign drive letters and data. 
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Drive partitions can even be set to different sizes than 
were originally backed up. This way the user has full 
control of the restore process. 


Performing a Full Restore 


The first step towards machine recovery is cre- 
ation of new partitions. To do that we can either use 
the description of partitions from the usage file in the 
last backup or let the user decide the partition sizes 
and types. The usage file is used for both local and 
remote restores and gives all the information about the 
old partition table. In order to determine new partition 
sizes we use a simple algorithm that takes into account 
old partition size, new partition size, the number of 
partitions, and the percentage of usage. 


Before we write the partition table we have to 
make sure we have a valid Master Boot Record 
(MBR). To be sure we dump our MBR onto the first 
32 sectors of the drive. It is important to keep in mind 
that the MBR that is written at this moment has no 
partition information. If there were any partition info 
at this step in the MBR, we couldn’t be sure that the 
disk geometry we are using is correct. 


After writing the MBR, we write the partition ta- 
ble. After the partition table is successfully written we 
have to format all partitions. One of the issues is the 
need to support all of the current Windows file sys- 
tems such as FAT, FAT 16, FAT 32 and if possible, 
NTFS. Linux can format all of the FAT file systems, 
but can’t create bootable FAT file systems. In order for 
a file system to be able to boot, the master boot record 
must point to a valid boot sector. Support for NT, 
WIN2k and XP is provided through the use of our 
application. We pieced together information about 
Windows boot sectors and after long debugging found 
a way to create valid boot sectors on our own. The 
reason why we are unable to use the original boot sec- 
tors from a previously backed up machine is simple. 
Boot sectors are dependent on partition sizes and 
geometry, thereby requiring us to create them every 
time we repartition. Another reason for not restoring 
the boot sector from a backup is that boot sectors are a 
favorite hiding place for viruses. 


After the disk is formatted and the boot sectors 
are written, we start the client application to restore 
the data. If we are performing remote restore, the 
client connects to the server and upon successful 
authorization the files, including the operating system, 
are copied to the local partition. This process is 
repeated for as many partitions as necessary. After all 
the files are transferred the machine is rebooted and 
available for work. Here is a summary of the steps 
performed in this process: 

e Write general MBR 

e Write new partition table 

e Format partitions 

¢ Mount boot partition 

e Start Sys16 (for FAT16) or Sys32 (for FAT32) 
to create valid boot sector 
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e Transfer system files 
e Copy remaining files 
e¢ Unmount partition 


If we are performing a local restore there are 
multiple issues we have to face. The first problem is 
related to having the backup located on the same drive 
we are trying to restore to. If this is the case, we are 
unable to reformat the partitions and also we cant 
change partition sizes. Another issue is related to 
NTFS support in Linux. Lets say we are backing up to 
the C drive and it is formatted NTFS. When the 
restore starts it will find the backup on the first parti- 
tion and notice that the partition type is NTFS. While 
Linux has very good support for reading NTFS file 
systems it has minimal support for writing NTFS. The 
solution to this, which is detailed in the next section, is 
a technique for formatting an existing NTFS partition 
as FAT32 while preserving the backup files. 


Once all the preparatory steps are successfully 
completed, we start unzipping data to the desired par- 
tition. If we have only a base backup, the restore 
process ends when unzipping of the base backup.zip 
file is completed. In case of incremental backups the 
restore process is more complicated. Suppose we have 
three incremental backups and the base backup. If we 
wish to restore the third incremental backup, we start 
by unzipping the backup.zip located in the Incremen- 
tal 3 directory. Then we unzip the backup.zip located 
in the Incremental 2 directory and so on. We do this 
until we have finished the backup.zip in the base 
directory. Each time we have to make sure that no files 
get overwritten. 


Once unzip finishes we have to create post pro- 
cessing scripts that will run immediately following 
Windows boot. We have to take care of two problems: 
proper assignment of drive letters and NTFS conver- 
sion. In case of a backup with more than three parti- 
tions we cant be sure that once Windows comes up it 
will assign correct drive letters to their respective parti- 
tions. It is also possible that we didnt use C, D, or E as 
drive letter in Windows but for example C, G, and V. 
While performing backup we add a file called drivelet- 
ter.sys to each drive on the hard disk. This file only 
contains the drive letter. The first thing after restore we 
need to do when windows comes up is change drive 
letter names. This is done easily by changing registry 
entries to values we read from driveletter.sys and 
doesnt even require a reboot. A second problem is 
related to NTFS partitions. When we restore we create 
our partitions to be FAT32 and format them accord- 
ingly. Once restore is completed and drive letter 
assignment has run its course, we have to convert those 
partitions back to NTFS. This is accomplished using 
the convert.exe utility that is supplied in Windows. 


Upon completing conversion of the drives to 
NTFS we have to set attributes and ACLs for all files 
on that drive. We wrote a simple application that reads 
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the content of the attributes.ntfs file and sets ACL, 
System ACL, Owner SID, and Group SID as well as 
file creation/modification times. This application lets 
us set all file attributes. Upon completion it deletes the 
attributes.ntfs file and exits. That is also the last step 
in post processing. 

Same Partition NTFS Backup 


In order to overcome the lack of write support in 
the Linux NTFS driver we developed a technique 
whereby an NTFS partition can be formatted as FAT32 
while preserving the backup files. This is in essence 
converting an NTFS partition to FAT32. 


The conversion process consists of a number of 
steps. First a meta file which contains data about the 
file to be preserved is created. Next the set of parame- 
ters for formatting the partition as FAT32 is carefully 
determined. The next step is running through all of the 
files to be preserved and relocating on disk only those 
portions that need to be moved in order to survive the 
format. The partition is formatted and the files are res- 
urrected in the newly created FAT32 partition. Finally, 
directories are recreated and the files are renamed and 
moved to their original paths. 


The set of files that need to be preserved must be 
known a priori. In the case of the LifeBoat project, 
this consists of a directory and a small set of poten- 
tially large files. The first step is to create the meta file 
which contains enough information to do a format 
while preserving these files. The meta file may be cre- 
ated immediately after a backup from within Windows 
or, if the NTFS partition is readable, it is created in a 
RAM disk from within the Linux restore environment. 





NTFS 





Bre: 

AT32 

Figure 9: The top partition shows the first four clus- 
ters of an NTFS partition, each with two sectors 
per cluster. Below is a FAT32 partition with a FAT 
size of three sectors followed by the first three 
data clusters. This illustrates how a FAT32 parti- 
tion with the same cluster size can be created yet 
the data is no longer cluster aligned. 


In the case of Windows, the file locations are 
available through standard API’s, and the meta file 
contains itself as the first entry. In the case of Linux 
the NTFS driver does not provide a way to find out the 
clusters of a file. An ioctl was added to the driver for 
this purpose. A typical meta file is well under 8K in 
size SO excessive memory use is not a concern. 
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Creating a meta file is not the only preparation 
required for formatting the NTFS partition as FAT32. 
The data files all reside on cluster boundaries. Unfor- 
tunately, NTFS numbers its clusters starting with zero 
at the first sector of the drive, while FAT32 begins its 
clusters at the sector immediately after the file alloca- 
tion tables. Formatting with the same cluster size does 
not necessarily mean that the clusters will be aligned 
properly (see Figure 9). 


A solution to the cluster alignment problem 
would be to always format the FAT32 partition with a 
cluster size of 512 bytes (one sector) and cluster 
downsizing the extent data by splitting it into 512 byte 
clusters. In practice this leads to an extremely large file 
allocation table when partitions run into the gigabytes. 


The cluster size of the FAT32 file system is 
determined by constraining the size of the resulting 
file allocation table to be a configurable maximum 
size (default 32 MB). The simplest way to determine 
this is to loop over an increasing number of sectors per 
cluster in valid increments until the resulting calcula- 
tion of the fat size exceeds the maximum. In order to 
align the clusters, we manipulate the number of 
reserved sectors until the newly created FAT32 parti- 
tion and the former NTFS partition are cluster aligned. 


At this point the layout of the FAT32 file system 
and the potentially larger cluster size is determined. 
Before formatting can occur, the extents of all the data 
files must be preprocessed to relocate any extent that 
is either located before the start of the FAT32 data area 
or does not start on a cluster boundary. In the best 
case, the cluster size has not changed, so only the first 
set of relocations must occur. Otherwise relocating an 
extent requires allocating free space on the disk at a 
cluster boundary and possibly stealing from the file’s 
next extent if its length is not an integral number of 
clusters. Moving an extent’s data is time consuming so 
it is avoided whenever possible. Free space on the disk 
is found using a sliding bitmap approach. Any cluster 
that is not in use by an entry in the meta file is consid- 
ered free. A bitmap is used to mark which clusters are 
free and which are in use. The relocation process 
requires that enough free space is available to success- 
fully relocate necessary portions of the files to be pre- 
served. When restoring to the same partition this will 
always be the case. 


Formatting is the simplest step. The mkdosfs 
program performs a semi-destructive format in that it 
only overwrites the reserved and file allocation table 
sectors. The ‘-f’ switch is used to limit the number of 
file allocation tables to one. 


Once the file system is formatted as FAT32, 
entries for the files to be preserved must be created. 
This is done via a user space FAT32 library written for 
this purpose. The user space library can mount a FAT32 
partition and create directory entries in the root direc- 
tory. It uses the data from the meta file to resurrect each 
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meta file entry by creating a directory entry and writ- 
ing the extents to the file allocation table. 


Once all of the files have been resurrected, it 1s 
safe to use the Linux FAT32 driver to write to the par- 
tition. The meta file is traversed once again to create 
the full paths and rename all of the files to their proper 
names. Finally, resident files are extracted from the 
meta file and written. At this point the partition has 
been converted from NTFS to FAT32 while preserving 
all of the files necessary to perform a restore. 


Centralized Management 


In the case of multiple work groups, management 
issues become highly important. If a system admin- 
istrator is supposed to deal with multiple groups with 
ten or more PCs he will need some sort of an auto- 
nomic system to simplify the management of storage. 


We based our system on IBMDirector which is 
widely available and boasts a high acceptance rate 
throughout the industry. To enable [BMDirector for 
our purposes we extend it in several ways. We devel- 
oped extensions for the server, console and clients. 
Below we quickly detail the nature of those extensions. 


Client side extensions are written in C++. The 
extensions provide all backup/restore functions as 
described elsewhere in this document. An important 
extension is related to communication between the 
client and server. The communication module relays 
all the requests and results between the two machines. 
The client also starts a simple web server which upon 
authorization provides information about the given 
client. This feature was implemented for the case 
where no IBMDirector server is available or when the 
server is not functioning properly. The information 
exported on the web page is the same as what can be 
obtained through the IBMDirector console. The infor- 
mation exported is shown below: 

® workgroup name 

© back-up targets 

e date of last successful backup 

® contact info 

¢ Number of drives 

e Size of drives 

e Free space on each drive 

e File system on each drive 

© OS used 

¢ Current status (performing back up, restoring, 
idle) 

e User name and user info 

© Location of the backup 


Server and console side extensions are written in 
C++ and java. They are rather simple since all we 
need to add on the server are basic GUI elements that 
allow us to interface with the client and to receive data 
sent from the clients. The most complex extension is 
related to extending associations so that all StorageNet 
devices in the same work group appear in treelike 


2004 LISA XVIII — November 14-19, 2004 — Atlanta, GA 


LifeBoat — An Autonomic Backup and Restore Solution 


form. The goal of this part of the project is to make a 
system that will be usable with or without the IBMDi- 
rector server. 


The Corporate Environment 


Our primary target environment in developing 
this system is a workgroup satellite office. If this is 
used in a corporate environment, there is the need for 
administrator level handling for setup, control and 
migration. Similar to the workgroup setting, the 
requirements of the workstation user are limited to: 

e knowing my data is backed up (having confi- 
dence) 

e knowing that my data is backed up to an area 
that will facilitate easy restoration 


In contrast, the administrator in the corporate 
environment has requirements for additional control 
and data, including: 

e wants different user’s data to be distributed 
evenly (or specifically) across several servers 

® wants reports specifying where a user’s data is 
backed up and the usage per server 

e during initial rollout, wants a way to seed the 
backup server destination to achieve the first 
goal 

e during server migration, needs a way for the 
user’s data to go to another server. 

The general processing flow is described below. 


The asset collection process on the user’s 
machine sends the UUID (machine serial number) to 
an administrative web server. A long running process 
on the server discovers available backup targets. The 
administrator reviews a web page containing unas- 
signed backup clients and discovered servers, and 
assigns these clients to a server. This information is 
recorded and used by the client backup process (usu- 
ally scheduled) to keep the user’s data. This assign- 
ment information is also used by the file and image 
restoration processes. Described graphically, we have: 





Figure 10: General flow with dashed lines representing 
meta-data and solid lines representing backup data. 
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. In parallel to this process, 


The detailed processing flow is: 


. The asset collection process extracts the machine 


serial number, type, and location and uses HTTP 
POST to save this in a web server. If the asset 
collection process is disabled for this client, the 
user can surf to a well-known web page where 
the same executable from the asset collection 
process can be downloaded and run. 

a long-running 
process resident on the server is busy discover- 
ing backup targets. These targets are Stora- 
geNet servers. The discovery protocol is lim- 
ited to the subnet where discovery is issued. 
Because of this, there is a web service located 
at a well known address in each subnet that is 
used by the server-resident process to discover 
servers in other subnets. The list of available 
backup targets is maintained and updated in the 
administrative server. 


. The administrator surfs to a web page contain- 


ing a list of unassigned clients and available 
servers. The processing behind this page auto- 
matically pre-selects target servers correlating 
to clients within their respective subnets. For 
those not pre-selected, or in which an override 
is requested, the administrator picks a server 
and one or more clients to back up to the 
selected server. This causes the machine file 
mentioned previously to be stored in that 
backup server. This is used for discovery by the 
restoration process. 


. The backup process on the client machine will 


normally be invoked as a result of a scheduled 
alarm “popping.”” When this occurs, the 
backup process will check for a machine file 
(containing its UUID) on all the servers on its 
subnet. If it finds this, it initiates the backup to 
that target. 


. If it does not find it, the backup process looks 


on the administrative server to determine which 
target it should backup to. If no assigned target 
is found an error is generated, otherwise the 
backup process spools the user’s data out to the 
assigned target. 


. When a file-based restore is requested, a process 


starts going through similar processing to the 
backup client to locate the user’s data. Then a 
network share using the StorageNet Windows 
file system driver (FSD) is created pointing to 
the target backup server. This FSD allows the 
use of normal Windows-resident tools to access 
the backup data as described above. 







Restore 2.3 GB | 1100s | 





Backup and Restore Times (seconds) 
| Local HD _| Local USB1.1 | Local USB2.0 | Remote 100 MB | 


Backup 2.3 GB NTFS 4254s yee aae | 
4440s 1001s 1200s 


Bonkenburg, et al. 


7. When an image-based restore is requested, a 
process starts going through similar processing 
to the backup client to locate the user’s data. 
Then a network share using the StorageNet 
Linux file system driver is created pointing to 
the target backup server. This file system driver 
allows the use of normal Linux-resident tools to 
access the backup data as described above. 
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Image of the FSD accessing a StorageNet 


Performance 


During the extensive testing we gathered several 
interesting numbers that reflect the speed and effi- 
ciency of the backup and restore process [5]. Figure 
12 shows the time in seconds for backing up and 
restoring 2.3 GB of data for a number of different tar- 
get locations [7]. The restore process is measured from 
clicking on the restore button to the finish (reboot of a 
machine). Our main test machine is a ThinkPad R32 
with 256 MB RAM and IBM 20 GB hard drive. 


A separate series of tests were performed using a 
1.6 GHz Pentium M IBM ThinkPad T-40. A 1.7 GB 
image requires three minutes (156 sec) to backup. The 
restore from local HDD requires 15.5 minutes from 
selecting the restore button of which ten minutes is file 
system preparation and data transfer and seven min- 
utes is rebooting and converting. 






Figure 12: Backup and restore times in seconds. 
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The final stage requires two minutes to complete 
the attribute restore into the NTFS file system. This 
makes the total time 17.5 minutes. We have tested this 
repeatedly for at least 20 times on four systems with min- 
imal variation. The main variations seem to be related to 
the file system preparation step which takes a minute or 
two longer after a base backup is re-established. 


When compared to Xpoints software: 

1. Backup is approximately 3X in performance. 

2. Compression Is typically 2X better. 

3. Our version works without dominating the PC 
while Xpoint’s version of RRPC does not. 

4. The restore performance for a base only is simi- 
lar. 

5. Restore of an incremental plus base is dramati- 
cally improved in ours since it is essentially the 
same as a base only while Xpoint’s takes about 
twice as long. 


Conclusion 


LifeBoat provides a way to backup system such 
that the backup files are accessible for single file 
restore as well as a full image restore. Our work also 
shows how Linux can be effectively used to restore a 
Windows(tm) system while also providing a rescue 
environment in which a customer can salvage recent 
files and preform basic diagnostics and productivity 
work. Most importantly this system allows for a 
machine to be completely restored from scratch when 
the boot disk is rendered unbootable. 


The local backup version of this work shipped as 
part of IBM’s Think 


In this paper we presented a description of the 
latest research project in autonomic computing at IBM 
Almaden Research Center. We described a fully auto- 
nomic system for workgroup based workstation 
backup and recovery with options for both everyday 
restore of a limited number of files and directories as 
well as full catastrophe recovery. 


This project is work in progress and is funded 
partially by the IBM Personal Systems Institute. 
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PatchMaker: A Physical 
Network Patch Manager Tool 


Joseph R. Crouthamel, James M. Roberts, Christopher M. Sanchez, and 
Christopher J. Tengi — Princeton University 


ABSTRACT 


PatchMaker is a network management tool that tracks and stores descriptions of physical 
network patches in a database. Physical patches are cables that connect switches, patch panels, and 
room ports. PatchMaker allows administrators to manage switch ports from multiple vendors 
giving them the ability to enable/disable ports, make VLAN changes, and perform other switch 
port functions through a common web front-end. Finally, PatchMaker aids in host management by 
tracing network connections, identifying what kinds of devices are connected to the network, and 


helping to troubleshoot physical network problems. 


Introduction 


Over the past year, the network/systems adminis- 
trative group in the Department of Computer Science 
at Princeton University has been putting together a 
plan (along with a good bit of code) to ease the lives of 
both the end-users and administrative staff. The many 
goals laid out by the team revolve not only around 
ease-of-use and happy end-users, but also around 
issues such as security, resource accounting (and, 
indeed, accountability), and administrative overhead. 


PatchMaker was the first application of a suite of 
software. Our goal in developing PatchMaker was to 
integrate all the tools needed to deal with end-user 
requests related to the network into a single web GUI 
front-end. We wanted a system that would track physi- 
cal patches, manage switch ports, track hosts, and be 
extensible. Since the cost of buying enough switch 
ports to connect all available ports on every wall box 
would exceed our budget, we needed a simple way to 
track patches from the wall boxes all the way back to 
switch ports. 


In the past, whenever a user changed locations 
we would be required to move the original patch phys- 
ically to another patch panel and, at the same time, to 
update the cabling documentation. We wanted to 
improve our turn-around time and respond faster to 
user requests. PatchMaker allows an administrator to 
locate a host quickly, find out what switch port it is 
located on, and move the cable to its new location 
without needing to trace the connection by hand. 


The Department of Computer Science at Prince- 
ton University has 1600+ CAT Se RJ-45 ports located 
throughout the building. Each room contains multi- 
mode and single-mode fiber optic ports along with 
RG6 for CATV. We are using two Foundry FastlIron 
1500 switches and an Alcatel OmniCore 5052 switch 
to connect the bulk of the user ports throughout the 
building. While most of the network connections 
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“home run”’ back to a central machine room, we are 
starting to maintain our networks in other neighboring 
buildings, as we have run out of space in the original 
five-floor building. 


Keeping track of all the connections from the 
room jacks, patch panels, and switches is very impor- 
tant, since users and hardware are moving all the time. 
Network administrators are constantly dealing with 
end-user moves, adds, and changes. Having out-of- 
date cabling documentation was always a problem and 
it increased the time required for an administrator to 
deal with end-user requests. 


PatchMaker was built out of a need for a more 
automated way of resolving problems at the physical 
network layer. Prior to PatchMaker, system adminis- 
trators manually entered changes reflecting patch addi- 
tions, changes, and deletions into a database. The sys- 
tem was often out-of-date. As a result, administrators 
would eventually abandon the database altogether, 
forcing us to perform audits twice a year to re-inven- 
tory the location of all physical patches. In developing 
a patch database system that was easy to keep updated, 
and by adding tools to administer switches and to trace 
switch connections, we have found that our adminis- 
trators are now using the system more effectively. 


Motivation For the Project 


Documentation for physical network cabling is 
essential in managing and troubleshooting a network. 
Documentation done after the fact is often wrong or 
out-of-date. In our previous system we would do a full 
building audit of the cabling and hardware connected 
to the network twice a year. We then would try to keep 
the data current using a locally written PERL script. 
Audits were very time consuming and often data was 
out-of-date even before the audit was even completed. 


Once the information was stored in a database, 
administrators did not keep the patch information 
updated. The system was used to track patches, but 
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administrators had little incentive to use the system, 
because in most cases it was faster to trace a patch by 
hand than it was to use the database. Over the course 
of several years we attempted to institute procedures 
with the aim of encouraging administrators to main- 
tain the system. We still failed to gain total compli- 
ance, as administrators typically respond to emergen- 
cies and to frequent experimentation by creating ad- 
hoc patches that never get documented. 


As the department grows, new _ technology 
becomes available, systems administration tasks also 
grow, so we needed to identify areas of our work that 
could be automated, freeing up time to focus on other 
tasks. The PatchMaker system was developed to solve 
the problems we had with the previous system and to 
add functionality that would make the system more 
useful for other tasks related to network administra- 
tion. The previous system’s principle failing was that 
the data was not up-to-date. PatchMaker addresses this 
problem by making data updates fast and easy, while 
removing the previous ineffective system. The previ- 
ous system only stored physical patches. PatchMaker 
adds more tools and bundles the functionality needed 
to complete the entire process in a single system, thus 
making it more useful. 


We could not find existing tools that incorpo- 
rated all the things we wanted. One open source tool, 
LANdb, did not fit our needs, as it only maintained a 
database of patches, and did not do switch manage- 
ment. The few commercial products that do exist are 
focused on switch management and tend to be very 
expensive. Other people we spoke with used a simple 
database, but they experienced similar problems keep- 
ing the data current. 


While intelligent patch panels that totally auto- 
mate the record keeping process exist, this was not a 
viable solution for us since we already had an infra- 
structure in place. It would have been expensive and 
time consuming to swap out patch panels. Intelligent 
patch panel systems usually log to a proprietary data- 
base. Proprietary databases are often inflexible, and 
using them makes it difficult to retrieve data for other 
purposes. Also, at the time we were considering them, 
intelligent patch panel manufacturers did not sell pan- 
els rated for Category 6 cable. This would have been a 
problem for us in the future. 


What It Can Do 


PatchMaker is a visual web GUI front-end to our 
network. Incorporating three main tools needed to 
make network changes, we created a tool that admin- 
istrator’s find useful and allows us to keep our cabling 
documentation up-to-date. The first part of the system, 
the network cabling documentation database, keeps 
track of where all the cables are located end-to-end in 
the building. The next part of the system is a switch 
management interface that allows an administrator to 
make changes on switch ports from multiple switch 
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vendors. The last part of the system is host and device 
tracking and management. 


PatchMaker allows an administrator to search for 
a host or device in multiple ways to find its location in 
the building. Using the built-in search tools, adminis- 
trators can locate hosts by MAC addresses, hostnames, 
or physical locations such as room numbers or patch 
panels. Using a web interface an administrator can 
trace a patch, receive information about what is con- 
nected to a particular port, and make changes to 
whichever switch port the host is using. 


When debugging a network problem, there are 
occasions when one suspects that the physical cabling 
or switch ports are not functioning properly. To help 
with this, PatchMaker will map the user’s wall box 
location straight to the switch port and perform diag- 
nostics remotely. PatchMaker is able to do this as the 
patch panel database contains mappings that go from 
the point the user plugs into the wall through interme- 
diate patch panels to the end-point represented by a 
port on a switch. A user who is not connected is not 
generally happy. While simple solutions can some- 
times be identified by the clients themselves, there are 
other times when cabling, or even a switch port can be 
faulty. In these cases, having the ability to map the 
user’s wall box location straight to the switch port, and 
then remotely to perform diagnostics, is invaluable. 
PatchMaker does rely on a host being connected in 
order to locate it and gather information about the 
host. If a host moves to a port that is not physically 
wired to a switch port, there is no way of obtaining 
host information or doing configuration on the port. In 
this case, we would create a patch by hand. 


In the course of normal network management, it 
is sometimes necessary to make small changes or to 
monitor ports: VLAN membership, link status, 
resource allocation (protocol, TCP/UDP usage, IP 
source and destination, bandwidth utilization). Patch- 
Maker has the ability to configure ports from the 
switch-side and allows an administrator to enable/dis- 
able ports, change a port VLAN, or check the link sta- 
tus and speed of a port. This allows the administrator 
to quickly locate problems, identify hosts that may 
have a security problem, and disable the port to which 
a host is connected. 


PatchMaker has a port-monitoring and _ alerts 
interface with links to graphs showing MRTG and 
Sflow (RFC 3176) data for the particular port. Using 
PatchMaker, we can access multiple tools from a sin- 
gle interface. For example, using SNMP and Sflow, 
we graph port traffic, maintain a log of traffic protocol 
data, detect sources of congestion, and manage quality 
of service on the switches. 


Implementation 


PatchMaker was developed using open source 
tools. It is written in PHP and DHTML. It uses PHP’s 
database abstraction layer, courtesy of PEAR, and 
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allows for a variety of database types to be used on the 
backend. The database stores information about 
patches, buildings, racks, and rooms. It contains map- 
pings that stretch from the point where the user plugs 
into the wall, through any intermediate patch-panels, 
to an end point represented by a port on a device. 
Switch information is stored in the database and 
includes attributes such as port count, device name, 
SNMP community strings, and switch slots. Image 
information for devices, patch panels, port pictures, 
and coordinates are stored in the database. An image 
map for a new patch panel or switch can be added and 
integrated into the new system in a matter of minutes. 


The application runs very fast, querying the 
device, mostly via SNMP. The returned information is 
saved temporarily so other sections or panels with 
ports that patch to the same device can avoid re-query- 
ing the same information. It is written to work with 
various switches from different vendors, and plug-ins 
can be easily written for nonstandard devices. 


The application displays a graphical representa- 
tion of the patch-panels and devices to a web browser. 
The GUI lets the user search by building, rack, room, 
Ethernet address, hostname, etc. The user can select 
multiple patches and devices to view at the same time. 
At quick glance, the user can determine patch and link 
status by the color of the port image. Statistical infor- 
mation about each port, such as patch point, port 
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speed, and VLAN can be viewed in the side-panel 
when the cursor is placed over it. Clicking on a port 
gives the user a context styled menu for enabling and 
disabling, changing VLANs, adding and deleting 
patches, changing rooms, as well as other commands. 


PatchMaker incorporates links to other network 
monitoring tools which use Sflow (RFC 3176) and 
NetFlow to monitor our Foundry and Cisco switches, 
respectively, and gather port statistics. We also use 
SNMP to monitor traffic load on the switches and 
include links to MRTG (Multi Router Traffic Grapher) 
to display graphical images of the network traffic. 


Even with some automation efforts, it is still nec- 
essary to enter patch information into the database. We 
have, nevertheless, built-in checks that watch for 
patches that have not been properly entered in the 
database. Figure 1 shows a device view in Patch- 
Maker. In the figure, unidentified ports are marked 
with a black ‘X’ across the port. This allows an 
administrator to quickly identify ports that have 
patches, but are missing an end-point record. The sys- 
tem allows an administrator to know if a patch they 
are making is already defined in the database, and then 
it prompts them to update the database without having 
to remove the old patch. Figure 3 shows all the options 
that can be performed on a port. PatchMaker allows an 
administrator to perform switch port tasks like 
enable/disable ports, change VLAN, or port speed no 
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matter what view you are in. Most network manage- 
ment tools give an administrator a switch view and 
allows for management of the switch from only this 
view. In PatchMaker if you view only the patch panels 
you can still change the switch port to which the patch 
panels connect to. Figure 4 shows a building room 
search and displays both the switch and patch panel 
information with an option to select a patch to view 
more information such as connected hosts. Figure 5 
shows how you can display multiple devices and patch 
panels at the same time. 


PatchMaker can handle more complex cabling 
infrastructures, including network closets and multiple 
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cross connects, but Figure 2 illustrates a basic cabling 
diagram. In this diagram, a switch port is connected to 
a patch panel, then to a wall box, and finally to a host 
or other device. On the one hand, it is relatively sim- 
ple to discover what host or device is on a particular 
switch port using switch management tools. On the 
other hand, information about which patch panel is 
used, and what wall box connections exist between the 
switch and the host, can only be discovered by hand 
tracing the connection or by keeping the patch docu- 
mentation current. As long as patch panels and wall 
boxes are passive devices and do not signal their pres- 
ence, this limitation is inherent. 
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Figure 2: Basic network cabling diagram. 
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Future Enhancements 


While PatchMaker is very useful to us, there are 
a few features we would like to add over time. In 
order for us to release this as a package so others can 
use it, we plan to write a GUI creation tool that would 
be used to set up patch panels and switch images in a 
drag and drop style with a small set of available 
images. We also would like to enhance user authenti- 
cation to allow finer-grained control of user privilege 
levels. Other features we are working on include 
switch configuration file management, quality of ser- 
vice or rate limiting management, security access con- 
trol list management, and change logging on the sys- 
tem and user levels. 


Conclusions 


The system was built out of a need for a more 
automated way of resolving problems at the physical 
network layer. Prior to PatchMaker, system adminis- 
trators manually entered changes reflecting patch 
additions, changes, and deletions into a database. The 
system was always out-of-date. As a result, adminis- 
trators would eventually abandon the database alto- 
gether, forcing us to perform audits twice a year to re- 
inventory the location of all physical patches. In 
developing a patch database system that was easy to 
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keep updated, and by adding tools to administer 
switches and to trace switch connections, we found 
our administrators are using the system. 


Since the system went into production, we have 
reduced user downtime and also reduced the time 
spent by administrators in dealing with network- 
related problems. We also decreased our response time 
in dealing with hosts that become compromised by a 
virus or trojan. We are able to disable the port and to 
take corrective actions on the hosts before the situa- 
tion worsens. The search functionality also allows us 
to track hardware as it moves from room to room 
around the department. Our goal in creating the sys- 
tem was to build a tool that others could use and 
quickly deploy in their own environments. 


Availability 


PatchMaker will be released under GNU Public 
License (GPL) and will be made available at: http://www. 
cs.princeton.edu/patchmaker. To request more infor- 
mation about PatchMaker please send email to patch- 
maker@lists.cs.princeton.edu. We expect to release a 
public version of PatchMaker in November, 2004. 
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ABSTRACT 


Periodic data backup is a system administration requirement that has changed as wireless 
machines have altered the fundamental structure of networks. These changes necessitate a 
complete rethinking of modern network backup strategies. The approaches of the 1980’s and 
1990’s are no longer sufficient and must be updated. In addition to standard backup programs from 
vendors, specialized system administration tools are often needed. This paper examines one 
backup system and the major software components used to implement it. NCSA has developed a 
Backup Tracking System (BTS)! to perform backup operations based on knowledge of the 
network and when each machine was last successfully backed up. BTS can chronologically list all 
computers: from those currently attached to the network through those that have ever been 
attached over the life of the BTS program. BTS also provides information about all backup 
operations including the time of last attempt, success state, amount backed up, etc. The BTS 
database also contains the date of the last successful backup for each machine and whether it has at 


least one VIP user (to be given preferred status during backups) or all non-VIP users. 


Introduction 


Modern networks of end-user machines are 
becoming increasingly dynamic and heterogeneous. 
Operating systems come in various versions of Unix, 
Windows, or MacOS. Mobile hosts, which may only 
be available on the network rarely or on an intermit- 
tent basis, have become almost as common as desktop 
workstations. The data on individual hosts can be criti- 
cal to the success of an organization (for cautionary 
stories of those who have been victims of data loss 
without backup see [19]). 


A wide variety of backup and data integrity tech- 
niques exist, and they vary in cost, features, and effec- 
tiveness. Mirroring SAN systems are at the very high 
end. Such systems can provide real-time on and off- 
site mirroring and versioning of data as it is modified, 
and can allow quick recovery from both common and 
catastrophic failures. At the low end, users can indi- 
vidually manage backups to removable media, such as 
external hard drives, CD-R, or Zip disks. 


While the techniques available to protect data 
have increased greatly, the management of such pro- 
tection has not. Systems such as Amanda [14] expect 
collections of always-on, always-connected Unix 
workstations. Later commercial products like IBM’s 
Tivoli Enterprise Management Suite [9] and Legato 
Systems NetWorker [11] have focused on extending 
support for the backend archive technology and new 
host operating systems. 


‘Funded in part by a grant from the Office of Naval Re- 
search (ONR) under the auspices of the Technology Re- 
search, Education, and Commercialization Center (TRECC) 
established at NCSA/University of Illinois. 
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Unlike previous systems, BTS is aware of the 
disconnected nature of modern networks and manages 
backup based on system and user priority. For exam- 
ple, a missed backup in most backup systems at best 
causes a host to be bumped in priority for the next 
scheduled backup. In contrast, BTS backups are on- 
demand, and if a host goes beyond the acceptable win- 
dow without being backed up, a system administrator 
will be alerted to investigate the reason for failure. 


Motivation 


The differences in characteristics between older 
and more modern networks of end-user hosts necessi- 
tate revisiting the motivations and goals of data man- 
agement. Modern networks are dynamic, and it is 
beyond the capability of current systems to cope with 
increasingly disconnected machines and backup latency. 


The Reasons for Backups 


There are many reasons why data backup is a 
crucial requirement for virtually every organization. 
The well-known, traditional reasons still hold. Disas- 
ters such as a flood and fire strike networks. Users 
inadvertently delete files and overwrite existing files. 
Hackers or disgruntled employees do the same inten- 
tionally. Disk drives, inherently fragile mechanical 
devices, fail, and lose all of the data they hold. Addi- 
tionally, files become corrupted by bad disk sectors, 
magnetic fields, and improper system shutdown. 


Beyond the traditional threats, there are new 
threats to today’s systems. Thieves steal laptops, and 
the data contained on them, a threat which is much 
less applicable to traditional workstations and servers. 
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While the most skilled social engineer will have trou- 
ble convincing even naive users to purposely delete 
data, even simplistic email viruses trick these same 
users into running hostile code with depressing regu- 
larity. Finally, the threat posed by modern worms 
dwarfs those of older worms [16], and they are able to 
compromise every vulnerable machine on the Internet 
faster than any manual response can prevent [17]. 


Organizations depend on their computer systems 
more than ever. Loss of data is therefore more expensive 
than ever in terms of lost work and downtime. Addition- 
ally, the public’s increasing awareness of the importance 
of data security means that data loss has a large negative 
publicity component. With increasing threats and 
increasing costs, backups are more crucial than ever. 


Lastly, we realize developing a backup strategy is 
an individual process specialized to specific network, 
data, and organizational objectives — different strate- 
gies work for different purposes. A survey of factors to 
consider such as contained in [8] provides an excellent 
planning tool for developing backup strategies. 


Properties of Good Backups 


In a well-managed network, backup operations 
are performed on a regular basis. Additionally a good 
recovery system is essential. During both normal use 
and recovery, backup operations should be transparent 
to users. Backup operations should be automatic and 
not be the responsibility of users. Instead, a system 
administrator should centrally manage backup and 
recovery operations. Since backups are a high priority, 
they should be managed by a person who understands 
their importance, rather than a new hire or intern. 
Finally, the scale of modern networks is beyond what 
can be manually managed. Good management requires 
human intelligence supported by automated informa- 
tion gathering and management. 


Backup Nuances 


Networks are categorized in various ways. A 
static network consists of physically attached worksta- 
tions with a network structure that only administrators 
modify, and then only rarely. A dynamic network adds 
wireless machines and constantly changing physical 
structure. A homogeneous network consists of similar 
attached devices all running the same operating sys- 
tem. A heterogeneous network adds a mixture of vari- 
ous devices with different operating systems. Dynamic 
heterogeneous networks are a superset of static homo- 
geneous networks, and performing backup operations 
in these networks is more complicated and requires 
additional tools. This paper discusses backup opera- 
tions in the more general case, which applies to most 
modern networks. 


We identify four distinct factors that account for 
backups being more complicated in a dynamic hetero- 
geneous network. The proliferation of laptops exacer- 
bates these factors, as laptops multiply the dynamism 
of the network. 
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¢ Networks consist of both physically connected 
and wireless computers. Both types need. suc- 
cessful backups on a regular basis, yet each 
requires a different strategy. A physically 
attached workstation can be scheduled for 
backup when the network workload is light. A 
wireless computer requiring a backup must be 
scheduled while it is currently connected. 

e Some computers may go days, weeks, or longer 
without logging onto the network. These 
machines cannot be backed up at a fixed sched- 
uled time each night. To effectively backup 
these computers, a system must maintain infor- 
mation identifying the last time a computer 
logged on and the time of the last successful 
backup. 

e Some users have multiple computers, such as 
several laptops and a workstation, which they 
periodically switch among. A person may use 
several machines in the same day and then use 
one machine exclusively for several months. 
All of the machines must be backed up. 

e A machine can have multiple users who per- 
form different types of processing. One user’s 
job function may require preferred treatment of 
the machine during backups. Although some 
users are aware of the importance of backups, 
most are not and want no role in the backup 
process. Finally, some users’ work habits are 
not conducive to good backup practices. 


In a dynamic environment, the networked com- 
puters must initiate the backup operation, since the 
backup server does not know who is attached at a 
given time. Hence, software installed on each net- 
worked computer must coordinate data exchanges 
with the backup server. Whenever a new computer is 
added to the network, the backup client program 
should be part of the initial software load. Existing 
computers also need the client software installed. 


Client software installation requires knowledge 
of which computers are actually present on the net- 
work. There may be no central point of control to 
identify when a new computer is added to the net- 
work. Likewise, existing computers can be perma- 
nently removed from the network without informing 
any authority. When a machine without backup soft- 
ware that has not been seen for months suddenly reap- 
pears, the machine’s user needs to be contacted to 
install the software. Another computer not seen on the 
network for a comparable time may never reappear 
again. It would be a waste of time to contact its user. 


Related Work 


This section highlights backup systems or applied 
research with relevance to BTS. A comprehensive sum- 
mary of all backup systems could not be included here, 
so we have selected a cross-section of the previous 
work. For a more comprehensive description of backup 
system issues and examples, see [8, 5]. 
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Amanda (Advanced Maryland Automated Net- 
work Disk Archiver) is an early example of freely 
available backup management software [1, 14]. It uses 
a combination of full and incremental backups to con- 
currently backup networked clients to a single desig- 
nated backup server and uses configuration files to 
determine the type of backup to perform. It has 
research significance in that it attempts to minimize 
cumulative overall backup per day in terms of number 
of backup runs, percentage change per backup run, 
and total amount of data [8]. Multiple commercial sys- 
tems [6, 9, 11] now provide Amanda-like functional- 
ity; however, none deal gracefully with wireless hosts. 


RAID [3] can protect systems against the failure 
of individual components. It provides no protection 
against unintentional/unauthorized modification of 
data, nor from catastrophic failure. Traditional RAID 
systems are impractical to field for mobile systems. 
However, a more recent RAID paper [15] applies 
RAID to a group of disconnected and distributed com- 
puters sharing storage remote from them. The aim is to 
provide a reliable RAID storage system that delivers 
acceptable performance while also providing a single 
coherent namespace for disconnected personal devices. 


[4] proposes a taxonomy for backups, including 
categories such as full versus incremental, file versus 
device, online or not-in-use, snapshot, and copy-on- 
write etc. It then places well-known backup programs 
including xdump, tar, IBM ADSM, Legato Networker, 
Amanda, Plan9, and Andrew into the above categories. 


Versioning file systems, such as Elephant [13], 
can protect against unintentional/unauthorized modifi- 
cation of data. However, a determined attacker can 
cause the history to be modified in undesirable ways. 
Even total versioning file systems like S4 [18] are no 
protection against physical failures. 


[10] evaluates four backup algorithm strategies: 
(1) incremental, (2) daily-full, (3) mixture of full- 
incremental, and (4) concurrent backups using backup 
streams. The paper compares the efficiencies of these 
algorithms for both backup and restore operations. 


In [7] a group of computers form a peer-to-peer 
network for backup operations. Data from one com- 
puter is distributed over other computers that have 
available capacity. The paper raises many non-stan- 
dard backup issues related to confidentiality, integrity, 
authentication, and various other security issues. This 
is not currently a viable commercial solution but a 
very interesting paper nonetheless that may have 
future intranet applications. 


The unique feature of BTS is its ability to priori- 
tize backup based on system and user priority. The 
closest related work in the spirit of BTS is [2] which 
examines dependability in infrastructure systems by 
placing priority on components based on their utility 
in terms of economics and operations research. BTS 
carries this utility concept forward specifically as an 
ongoing backup process controllable by the user. 
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BTS System Description 


To perform backup operations, a backup system 
must know which computers comprise the network 
and when each was last successfully backed up. 
Hence, in addition to the backup software and server, 
the BTS program monitors all networked computers 
and tracks backup status information, including the 
last successful backup date. BTS utilizes a database to 
manage this information on every computer that has 
connected to the NCSA network during the past sev- 
eral years. The relationship among these components 
is illustrated in Figure 1. 





| 2 
= aj—_r 


Web clients 





Enterprise Management 
Server 


Figure 1: Relationship between Clients, BTS, and 
Backup Server. 


Backup Server 


BTS tracks whether users are logged-in via the 
NCSA authentication system and creates records of this 
activity at sixty-minute intervals. BTS also polls net- 
work machines to see if they are online. Then, for each 
online host, BTS uses algorithms based on system/user 
priority and time- since-backup to determine if an 
immediate backup is needed. If a backup is performed, 
the files are downloaded to a Mass Storage backup sys- 
tem containing 6 TB of disk cache and an ADIC tape 
library with six tape drives. An extensible database 
containing the specific information used to determine 
backup priority is used to coordinate information shar- 
ing for the different BTS components as well as for ar- 
chiving and presentation to the user via a web interface. 
The data contained in the database includes the follow- 
ing: VIP users within the organizational hierarchy, 
problem machines, host-to-subnet mappings, and sub- 
net-to-geographic location mappings. 


BTS performs functions beyond helping with 
backup operations [12]. It provides information on 
network computers via the computer user name, IP 
address, and NetBIOS name. The Tivoli name is the 
identifier used for the backup processing. If the Net- 
BIOS name and Tivoli node name are different, this 
will be indicated in the host list. BTS can chronologi- 
cally list all computers, from those currently attached 
to the network through those that have ever been 
attached, over the lifetime of the BTS program. BTS 
also provides information about backups, including 
the time last performed, successful or unsuccessful, 
amount of data backed up, etc. Netview updates the 
BTS database every ten minutes. Hence only rarely 
will a network user go undetected. The BTS database 
also identifies each computer as having at least one 
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VIP user or only non-VIP users. Machines with VIP 
users have preferred status during backups. In prac- 
tice, most computers have a single user categorized as 
a VIP or non-VIP. 


Standard Types of Backup Operations 


Historically three basic strategies have been used 
to perform backup operations, varying in the amount 
of data backed up and ease of restoration. A full 
backup backs up all data on a scheduled basis, and 
requires the most time and storage. However, it is the 
simplest to understand and the easiest from which to 
restore data. An incremental backup begins with a full 
backup. Subsequent backup operations copy only 
those files that have been modified since the last 
backup. Hence, every backup after the first includes a 
relatively small amount of data. Periodically, a new 
full backup is performed. Each of the partial backups 
is stored on a separate tape, so restoration involves 
processing all tapes from the current full copy up 
through the incremental backups to the most recent. 
Differential backup is similar to incremental, except 
that after the initial full backup, a single device is used 
for all of the incremental backups. 


Progressive Backup Operations 


None of the three basic strategies (full, incre- 
mental, differential) are well suited for a dynamic net- 
work environment since dynamics violate the timing 
considerations the standard techniques require. For 
example, a wireless user may only remain logged on 
for occasional short periods. To alleviate these issues a 
fourth backup strategy is used: Progressive Backup. 
Progressive backup initially copies all files on a com- 
puter and generates a summary report identifying 
when each file was last backed up and last modified. 
This report can also contain other file attributes such 
as size and creation date. The more information stored, 
the more efficient subsequent backup and recovery 
operations can be. 


Each time a user logs on; a decision is made as to 
whether a backup operation is needed. Following the 
initial backup, subsequent progressive backup opera- 
tions compare current file information with the sum- 
mary report information. Based on this comparison, 
the backup only copies new and modified files. 
Unchanged files are not recopied. Determining the 
files that need to be backed up often requires more 
time than actually copying the data. Each backup 
operation also updates the summary report. Progres- 
sive backup can be extended to support versioning, 
where the most recent several versions and their sum- 
mary information are saved. 


During the backup, data is stored in the summary 
report relational database. SQL queries can be used to 
retrieve information about the backed up data associated 
with a given computer. Using the information about the 
backed up files in its database, it is possible for restore 
operations to be easily and correctly performed, some- 
thing that does not always occur with incremental and 
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differential backups. In some circumstances, these two 
strategies can restore redundant and even incorrect data. 


Depending on the storage media, files copied 
during the current backup may not be contiguously 
stored with existing backed up files from the same 
computer. However, the backup systems’ relational 
database identifies where each file from each com- 
puter can be located on the storage media. In this way, 
the database allows quick and easy restore operations 
to be performed. 


In summary, progressive backups require less 
server time, minimize the required network band- 
width, utilize less storage media to hold backed up 
data, and are more accurate and efficient for restore 
operations than the other types of backups. When the 
host computer initially contacts the backup server, the 
server initiates a backup operation immediately for a 
laptop and schedules a later time for a workstation, 
typically after the workday ends. 
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Figure 2: Venn Diagram categorizing backup status 
of clients. 


Hierarchical Backup Strategy 


Not all computers attached to a network need to 
have the same backup strategy — some computers and 
information are more important than others [2]. The 
Venn diagram in Figure 2 illustrates the various cate- 
gories into which a computer can be placed. While 
performing progressive backups as described above, 
individual computers are prioritized, and time slots are 
assigned to each priority. Computers used by a VIP 
are considered more important and given more atten- 
tion during backup operations than non-VIP comput- 
ers. Part of this extra attention is currently provided 
manually, although the BTS program provides some 
help. The more important the VIP, the more important 
it is that timely backups are successfully performed on 
their machine(s). In the process of getting all existing 
users to install the client software on their computer, 
VIPs have been contacted first, in order of their impor- 
tance. Computers used by the highest-level VIP are 
considered the most important computers. Computers 
used by VIPs reporting directly to this person are at 
the next highest level, etc. At the other extreme, the 
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users classified as least important will be contacted 
last. The system administrator responsible for backup 
operations generates this user importance ranking and 
implements a strategy for contacting the VIPs. 


The manner in which a “VIP” is defined will 
depend on the way that makes most sense for a partic- 
ular organization. For example, the most important 
VIPs may be one or more key software developers 
rather than the organization’s president. 


A backup failure on a VIP’s computer is given 
priority. Such a failure is investigated and the reason 
for the failure identified with an entry in the BTS report 
of all VIPs whose system has not been backed up in the 
last 10 days. The BTS Reports shown in Figures 4, 5, 
and 6 illustrate that it is possible to determine which 
VIP machines have been successfully backed up. 


Reports Produced By the Backup Tracking System 


When the BTS program starts, it displays a Menu 
screen that allows several types of reports to be gener- 
ated. Figure 3 shows the Menu screen prior to entering 
any data. Default values are provided for every data 
entry field including the three text fields. 


4:27:47 AM |TEXT SEARCH 
3/27/2004 


stats 


Backup 


TRACKING ip address © 


System help 
about 


ntuser © 
all © 


netbios name © | 
Find Host(s) | 
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The central portion of the Menu (labeled “Text 
Search’’) is used to generate a listing of all computers 
on the network whose name contains the value entered 
in the Find Host(s) by Netbios name; Find Host(s) by 
its NT user name or Find Host(s) by IP address. A par- 
tial wildcard may be entered in all three fields. With the 
IP address, a partial wildcard value is considered the 
beginning of the address. If nothing is entered, a listing 
of all computers in the network is generated. Figure 4 
shows the results of entering PC in the user name field. 


BTS generates reports that identify which 
machines have had a successful backup performed 
within the last N days, where N is 10, 30 or if it has 
ever been backed up. These reports can specify only 
machines belonging to VIPs, just non-VIPs or both 
groups. Three reports can be displayed that identify all 
computers successfully or unsuccessfully backed up 
within the last 10 days, the last 30 days and since the 
BTS started running. 


Radio buttons on the right hand side of the menu 
screen allows a user to make selections in four categories 
to identify which computers will be included in the 


CRITERIA SEARCH 10 days 


30 days ever 
onnetwork---y © CH yCChy CC 


successful backup--y ©“ @ mn yC CnyC Cn 


on tivoli----y © C all © 
, List Hostfs | 
VIP host---y © C n all © ) 





Figure 3: Menu screen showing default settings. 


HOST 

NetBios Name location 

"= Tivoli node name building(s) 
1. JIMMYPC ACB/ CAB 
2. JOHNPC DCB/ CAB 
3. JOEPC CAB 
4. KARENPC CAB 
5. ROBPC CAB/ CAB 
6. GREGPC DCB 
7. MONITORPC SRP / SRP 


BACKUP 
LastinNCSA location date 
Domain building 
3/27 (2004 ACB 3/6/2004 
3/27/2004 CAB 3/8/2004 
3/27 (2004 CAB 3/15/2004 
3/27/2004 CAB 3/5/2004 
3/19/2004 CAB 3/12/2004 
3/27/2004 DCB =. 3/10/2004 
3/26/2004 SRP 3/1/2004 


Figure 4: Listing of all computers whose user name contains “‘wildcardPC.” 


HOST 
NetBios Name location 
“= Tivoli node name building(s) 


—_ 


JIM-LAPTOP SRP 


2. SUSAN-LAPTOP SRP / ACCESS 
3. JAMES-DESKTOP CAB / BI 

4. STUDENTI BI 

5. RESEARCH-SERVER CAB 

6. WEB-SERVER Offsite / Offsite 
7. BACKUP-SERVER SRP / ACB 
8. TEST-SERVER SRP / ACB 


BACKUP 
LastinNCSA location date 
Domain building 
6/2/2004 SRP 7/5/2004 
6/2/2004 ACCESS — 6/25/2004 
7/26/2004 Bl 7/19/2004 
8/2/2004 Bl 7/21/2004 
8/2/2004 CAB 7/23/2004 
8/2/2004 SRP 7/21/2004 
6/2/2004 SRP 6/21/2004 
6/2/2004 SRP 6/21/2004 


Figure 5: Listing of all computers meeting a specified criteria. 
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report. One of three choices is selected from VIP Host, 
Non-VIP Host or Both Categories. One of two choices is 
selected from Contains Client Software (Tivoli) or does 
not contain Client Software. One of two choices is 
selected from Successfully or Unsuccessfully backed up. 
One of two choices is selected from On or Off the net- 
work. For the previous two choices, an additional selec- 
tion is made from one of three time intervals: last 10 
days, last 30 days, never. Figure 5 shows the results of 
entering VIP host, Client Software Installed, and Unsuc- 
cessful backup during last 10 days 


Any of the computers in the Figure 5 listing can 
be selected to have BTS generate a report providing 
additional general information and the status of recent 
backup operations for that machine. Figure 6 shows a 
report of the most recent backup operation results for 
computer NCSA-SERVER. 

Unsuccessful Backup Operations 

There are several reasons why some machines 
are not successfully backed up. Most commonly, the 
computer does not have a backup client installed on it. 
BTS is used to identify these machines. To resolve this 
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problem with existing networked computers, it is nec- 
essary to contact the user of the machine and install 
the software, a time consuming process. 


Another possible reason for failure is that the 
system has a backup client and is part of the network, 
but is unavailable to backup. BTS also identifies these 
machines. Many users have multiple machines and are 
currently using only one of them. If all the machines 
are not being used simultaneously, the machines not 
being used are not being backed up because they can- 
not be accessed. If a workstation is powered off at the 
end of the day, it cannot be backed up that night. 


Failure can also occur when the backup client 
software is installed on the client, but is incorrectly 
configured on the server. The server must be sched- 
uled correctly in order to backup the client computer. 


Finally, a few backups may inexplicably fail and 
require a restart of the backup server. 


Vendor Product Issues 


There are several commercial products that can 
do the type of processing described in the Progressive 





NetBios Name: NCSA-SERVER 
last seen in Network 3/27/2004 1:00:04 PM 
Neighborhood: 
Authenticated sumike 3/19/2004 | 
‘ sujim 3/17 #2004 
Windows user(s): ae 3/5/2004 
suamy 2/19/2004 
Administrator 2/19/2004 
surobert 1/9/2004 
Recent IP Number(s): 124.126.28.39 ACB High-end systems 
124.126.28.50 ACB Production Servers 
Tivoli Activity Log OBJECTS Eas 
date/time bytes inspected eee failed transfer total 
set ANS 138GB 1385363 1193 45 4min 02:30:14 
3:45:27 AM 
siti os 613.93MB 1384646 1049 46 22min 02:28:02 
3:43:24 AM 
sea/2004 940.81MB 1383900 585 22 38min 01:21:22 
11:07:56 AM 
BIGNIEM 102GB 1383733 494 22 3min 02:32:08 
3:47:38 AM 
SEaCEnes 1281GB 1363574 3943 22 90min 03:58:09 
6:13:40 AM 
siccieuee 88.968MB 1381592 83 {min 02:20:49 
3:35:67 AM 
9/24/2003 . : ; 
4-12-01 PM Tivoli installed 


Figure 6: Report on backup information For computer NCSA-SERVER. 
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Backup Operations section including the Tivoli Enter- 
prise Management Suite [9] from IBM, Retrospect [6] 
from Dantz Development Corp. and Networker [11] 
from Legato Systems. 


However, some backup products are designed 
specifically for small networks and do not scale to a 
network the size of NCSA. At NCSA, Tivoli is used to 
perform the actual progressive backup operations. 
Tivoli performs all of the relevant backup processing 
and does not significantly interfere with regular net- 
work traffic. Several other products not mentioned 
were initially tried but they negatively impacted net- 
work traffic. In addition, with some earlier products 
the server initiated the backup rather than the client, a 
bad idea in a dynamic networking environment. 


The Dantz Retrospect Professional product is 
designed for home and small offices using Windows 
and Apple Macintosh computers [6]. It does progres- 
sive backups with 100% correct restores and no redun- 
dancy (a significant feature). It uses compression and 
can backup data to any media. However this solution 
does not scale to larger networks the size of NCSA. 


Other products can simultaneously backup 
dozens of clients. When the client contacts the server 
to determine whether a backup is needed, the server 
makes a decision based on the identity of the client 
computer. Criteria can include the following: the client 
is a wireless with a VIP user — backup immediately; 
the client is a server or a workstation belonging to an 
important VIP — backup every 24 hours; a user work- 
station — backup every weekday, but not over the 
weekend; and a VIPs computer not seen on the net- 
work for weeks — backup immediately. 


Another significant backup issue is how to 
process the files that comprise well-known applica- 
tions that are running on most computers. Examples 
include Microsoft Word, Excel, Access and even the 
operating system itself. It should not be necessary to 
copy these files from almost every machine. The 
backup software can be provided with a list of files to 
exclude during backups. Alternatively, application 
software and the operating system can be reinstalled 
rather than restoring from a backup. 


BTS consists of an ASP application written in 
VBScript running with Microsoft IS 5.0 on a Win- 
dows 2000 server. BTS uses an Access database con- 
taining information about the networked computers. 
The database is distinct from the relational database 
used by commercial backup software such as Tivoli. 


Availability 


Various statistics have been collected from the 
NCSA network, but the most interesting and valuable 
have been measurements of availability in terms of 
systems and users. 


Let s represent the number of systems on the net- 
work at a given point in time, measured every ten min- 
utes. Therefore, the normalized system and variation 
for a given time period can be respectively defined as: 
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N,, = = 


O. 
Mae 
S 
In an environment which is extremely static, and 
therefore the systems are on the network and each user 
logs in every work day, NV, = 0 and N,, = 0. As the 
number of systems and users per day varies, the values 
of N, and N,, increase. For example, if the number of 
users vary an average of +10%, then N,, = 0.10. 


$= - (5, +5) +53 +++ + 5,) 





Figure 7: Hypothetical system availability. 


Similarly, let u represent the number of users 
who authenticate on a given 24-hour period, recorded 
once per day at midnight. 





Figure 8: Hypothetical user availability. 


It should be expected in environments which 
have high values of NV, and N,, that it would be consid- 
erably more difficult to backup, apply security 
patches, and track down systems than in an environ- 
ment with lower values of NV, and N,,. High values of 
N, and N,, would therefore imply a higher security risk 
for a given infrastructure effort, or to put it another 
way, a higher support cost for a given level of security 
and survivability. This ability to track usage trends has 
proved useful for capacity provisioning, security 
events, and equipment reliability failures. 


Figure 9 is a sample of real measurements of sys- 
tem and user availability (respectively) on the NCSA 
network. Over a three-year period of measurement, the 
average number of systems available is 300 with 600 
distinct systems, an upper limit of 400, and a lower 
limit of 100. The normalized system variation is 0.10. 
Over the same three-year period of measurement, the 
average number of users is about 190 users with 220 
as the upper limit and 30 as the lower limit. The nor- 
malized user variation is 0.53. 
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Conclusions 


Sharing the general class of backup problems we 
face at NCSA and our specific implementation solu- 
tions have proved to be valuable to peer organizations. 
We feel that the solutions we describe in this work are 
transferable to other environments even though this 
work was specifically targeted to the Windows envi- 
ronment. In fact, we already have a parallel project in 
progress transferring these same techniques to the 
Linux environment. 


Future directions include examining the possibility 
of moving some of the functionality of Tivoli onto the 
client, and have the client perform tasks (such as deter- 
mining the files to back up) via intelligent algorithms. 


Lastly, more information about BTS, implemen- 
tation instructions, and the software itself are available 
at this web page http://wegpublic.ncsa.uiuc.edu/bts . 
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Making a Game of Network Security 


Marc Dougherty — Northeastern University 


ABSTRACT 


This paper describes our experiences in the design and implementation of a network for 
security training exercises, and one such exercise. The network described herein is flexible, and 
could be used for a wide variety of training exercises. Organizations of all sizes can use this type 
of exercise to train new personnel or keep existing staff at their best. In addition to the training 
benefits, these exercises can also be highly entertaining. The designs described here are intended 
to assist others in the construction of similar networks, and additional training scenarios. 


Introduction 


Companies large and small invest a great deal of 
money in the training of security personnel and system 
administrators. However, there is little formal training 
available for individuals without a company willing to 
pick up the check. This leaves the majority of sysad- 
mins to learn the old fashioned way: by making mis- 
takes. This is especially true of aspiring sysadmins at 
the college level, where system administration and 
security are discussed as a side note, if at all. In light 
of this, we set out to create a safe environment where 
all types of sysadmins could practice their skills, with- 
out putting personal or business data at risk. 


Inspired by the “Capture the Flag” competition 
held annually at Defcon [1], we sought to create a sim- 
ilar environment, with particular emphasis on security 
and teamwork. Security has become a higher priority 
for many companies, security-related spending has 
increased drastically, and as a result, many sysadmins 
have been “promoted” to specialized security roles, 
without any additional training. Teamwork is also a 
common weakness for sysadmins. Many sysadmins 
work individually, and few have much experience 
working with others. Team training exercises like 
those proposed here can help overcome these common 
weaknesses and be entertaining at the same time. 


Because of the sysadmin predilection for experi- 
ential learning, informal training exercises like those 
described here can be enormously beneficial, both for 
the individual sysadmin and his or her employer. 


The environment described here is the result of 
hard work from a group of students at Northeastern 
University, working jointly with the Systems Group at 
the College of Computer and Information Science. 


The network we have created involves two 
smaller networks, connected by an OpenBSD machine 
which performs the majority of the routing, and pro- 
vides limited access to the Internet from both net- 
works via HTTP, HTTPS and SSH. Internet access is 
vital as it allows access to the most current patches, 
tools, and technologies for both defense and offense. 


The first internal network, called the defender net- 
work, is populated with the machines that contestants 
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administer. We provide contestants with a machine on 
this network, which they must defend. Contestant 
teams may choose their preferred OS from a set of 
alternatives provided by the contest admins. Members 
of the administrative team often sabotage default secu- 
rity features of these installations, making the defender’s 
job more interesting. 


The second network is a new addition to the con- 
test, inspired by many requests of participants to bring 
along their own machines for use in the contest. The 
attacker network was created to allow participants to 
plug into the contest, while minimizing their risk of 
being attacked. A strict set of ACLs prevent incoming 
connections to machines on this network and prevent 
hosts on this network from communicating with each 
other. 


The keystone of the network is the routing policy 
on the OpenBSD gateway machine. This routing pol- 
icy is written entirely in pf, the OpenBSD packet filter. 
No incoming connections are allowed to the attacker 
network, but responses to existing connections are 
allowed. All traffic from the attacker network to the 
defender network is passed through unmodified, 
because much of that traffic will be crafted packets 
and exploits which should not be tampered with. Other 
duties of this gateway machine include serving DHCP 
to the attacker network and providing DNS, NAT, and 
Internet access for both internal networks. This 
machine is also responsible for the Nessus server 
which is used by the scoring system. 


Since this scenario was designed to be as realistic 
as possible, there is a minimal rule set. The only form 
of attack that is disallowed outright is denial of service 
attacks. However, the rules allow the contest admins 
to deem the conduct of a team “unsportsmanlike,”’ 
and punish them accordingly. Punishments can range 
from a score penalty to expulsion from the contest. 


The total score of each team is the sum of offen- 
sive and defensive scores. The offensive score is very 
difficult to automate, and has typically been handled 
manually. When a participant compromises a service, 
they are required to leave some type proof, which the 
contest admins will then verify. Defensive scoring is 


187 


Making a Game of Network Security 


based on the availability of the team’s services. Periodi- 
cally, the scoring system tests the functionality of each 
service on the contestant’s host using several custom 
Nessus plugins written for this purpose. The state of 
these services is then recorded in a database. From the 
database, the information can take on any number of 
forms for presentation. Previous methods have included 
a text file, a web page, and an XML document pro- 
cessed by a Macromedia Flash application which dis- 
plays the information in a more exciting way. 


This contest has evolved to its current condition 
over the last several years, and continues to do so. 
Work is being done to move all routing policy into a 
Cisco Catalyst switch, which frees up the OpenBSD 
gateway machine to act as a web and ftp proxy for 
outbound connections. Also under development is a 
fully automated scoring system which merges the 
offensive and defensive aspects of scoring. The new 
system uses flag files to indicate when a service has 
been compromised. This scoring system requires sig- 
nificantly less manual intervention, and simplifies the 
lives of the contest administrators. 


Because we have been unable to find a suitable 
metric by which to measure system administration 
skill, we cannot prove that the participants in our con- 
test have become better admins. However, many par- 
ticipants have reported learning a great deal from the 
contest. In addition, many former participants have 
become members of the contest’s administrative team, 
which indicates that the contest inspires participants to 
strengthen their teamwork and security skills. 


Network Layout 


Designing a network for use in security training 
involves finding a delicate balance in communication 
policy, to assure that participants are given enough 
freedom to use various attack techniques, without 
granting them the ability to attack machines outside of 
the scope of the training environment. The environ- 
ment we have created consists of a defender network 
segment, and an attacker network segment. These net- 
works are connected to each other, and to the Internet, 
by a gateway machine that performs routing and 
packet filtering. 


Access to the Internet is provided to allow partic- 
ipants to research protocols and programs with which 
they are not familiar. The defender network is popu- 
lated with machines created by the contest administra- 
tors. The attacker network must be able to attack the 
machines on the defender network, while remaining 
protected from the attacks of others. Both networks 
must be prevented from launching attacks on other 
machines outside of the training environment. The 
establishment of proper routing policies is critical to 
the success of this network. If the routing policies are 
not restrictive enough, this network could be used to 
launch attacks to the outside. On the other hand, if the 
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policies are too restrictive, they may interfere with the 
intended use of the network by blocking some types of 
anomalous traffic. 


Internet Access 


The simplest of the network policies involved is 
the routing for the Internet connection. Both the 
attacker and defender networks are permitted access to 
the Internet via HTTP(S) and SSH. All other traffic 
bound for the Internet is dropped. In the past, we have 
allowed teams to participate remotely by forwarding 
external ports on the gateway to hosts on the defender 
network. Each machine on the defender network is 
accessible by Remote Desktop and Secure Shell. In 
the most recent incarnation of the contest, we encour- 
aged participants to physically attend, eliminating the 
need to provide remote access. 

Defender Network 


The defender network is home to the machines 
that the contestants must defend. In order to give partic- 
ipants the opportunity to experiment with sniffing tools, 
the defender network has typically been hubbed rather 
than switched. This network uses private, non-routable 
IP addresses in the 10.10.0.0/24 range [2], which are 
provided by the gateway. To decrease the potential for 
abuse of this network, Internet access is limited to 
HTTP(S) and SSH. To minimize the vulnerability of 
machines on the attacker networks, defender machines 
may not initiate connections to the attacker network. 


Attacker Network 


The attacker network was a new addition in the 
most recent incarnation of the network and its design 
posed an interesting challenge. The goal of this net- 
work was to allow participants to bring along their 
own machines for use the attacking portion of the 
exercise, while minimizing the risk associated with 
plugging into such a hostile network. 


The attacker network uses private non-routable IP 
addresses in the 10.20.0.0/24 range, which are pro- 
vided by the DHCP server on the gateway. Packets 
from this network destined for machines on the 
defender network are passed through with no modifica- 
tion since many of these packets are likely to be 
exploits or other types of malicious traffic. The attacker 
network is shielded from incoming connections from 
the defender network by the defender network’s filter- 
ing, but attackers are still susceptible to attacks from 
fellow attackers. The ideal way to isolate attackers 
from each other is to create a separate network for each 
attacker. Since we lacked the hardware resources to do 
so, the solution we have implemented relies on ACLs 
to control the network traffic of attackers. 


Using a Cisco Catalyst 3550, attacker communi- 
cations can be limited with VLAN-layer ACLs. The 
attacker network ACLs prohibit machines on the 
attacker network from communicating with any other 
host on the attacker network, except for the gateway 
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machine’s attacker network interface. This approach is 
significantly simpler than segregating each attacker on 
their own VLAN, and is more scalable. Even with 
these countermeasures in place, all participants are 
forewarned that they are plugging into a hostile net- 
work at their own risk. 






OpenBSD 
Gateway 







Defender Net 


Figure 1: All routing, policy (and proxy) functions. 





Gateway and Routing 


Because the routing and communication policy 
between networks is so important to the success of the 
network, the policy was implemented using the 
method with which the administrators of the network 
were most familiar. Currently, the routing policy is 
written using OpenBSD’s packet filter, pf [3], but 
there is work in progress to port this configuration to 
Cisco IOS. 


The packet filtering rules were created using a 
set of class-based queues designed such that adminis- 
trative tools like Secure Shell and Remote Desktop are 
allotted 20% of the available bandwidth, and given a 
higher priority than other traffic on the network. This 
was done to minimize the negative effects of network 
saturation on any segment of the network. 


We also learned — the hard way — that the default 
size of the OpenBSD state table is far too small to run 
a contest like this. Shortly after the contest began, the 
state table was filled to capacity and stalled network 
communications. We have since increased the state ta- 
ble to store 20,000 states, and optimized the state 
timeouts to be more aggressive. 


The hardware used for the gateway was a 500 
MHz Pentium II with 256 MB of RAM and three net- 
work cards. Although this may not seem like a lot of 
computing power, the machine performed well under 
the strain of the contest. The primary duty of this 
machine is the routing between networks and perform- 
ing NAT for both the attacker and defender networks. 
In addition, this machine also provides DNS resolution 
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for the attacker and defender networks using DNSMasq 
[4] and provides DHCP leases to machines on the 
attacker network. In the future, this machine may also 
serve as a web and ftp proxy for the attacker and 
defender network. 


This network began as a simple test network and 
has evolved into a secure environment for network 
security testing. The potentially harmful traffic is iso- 
lated to this network, while still allowing participants 
access to the Internet. This network provides a safe 
environment in which sysadmins of all skill levels 
may experiment without endangering any important 
data. This environment can be used to stage a wide 
variety of testing scenarios, training sessions, and 
security challenges. 


The Game 


One possible use for the above network is a 
game that we have created, based primarily on an 
annual Defcon tradition now known as Root-Fu [5, 6], 
formerly known as Capture the Flag. Each team of 
participants is assigned a host on the defender net- 
work, which they must secure against the attacks of 
the other teams. Teams are awarded points for each 
successful test of their services, and each successful 
compromise of an opponent’s service. Teams must 
balance offensive and defensive tactics to gain as 
many points as they can over the duration of the con- 
test. However, there are some restrictions on the tac- 
tics that may be used in this contest. 


Rule Guidelines 


Laying down rules for a security challenge is 
more difficult than it seems, and enforcing rules is 
even more difficult. Strict rules may prevent partici- 
pants from making use of some useful attack strate- 
gies. Rules that are too lenient could result in teams 
using unfair methods to give themselves an advantage. 
The current rules do not allow for ARP spoofing or 
Denial of Service attacks, because these techniques 
have been used to disrupt the contest in previous 
years. The rules allow the contest administrators the 
ultimate authority in determining if a team’s behavior 
constitutes “‘unsportsmanlike conduct,” and penaliz- 
ing them accordingly, either with scoring penalties or 
expulsion from the contest. To spot potential viola- 
tions, contest administrators keep a close eye on traffic 
in the defender network. Contestants are also provided 
with a private, direct means of communication with 
the contest admins, in the event that they suspect their 
opponents of violating the rules. Fortunately, rule vio- 
lations have been rare since most contestants under- 
stand the spirit in which the game is run. 


The newest addition to the contest rules is the 
concept of machine ownership. If a team has gained 
administrative access on another machine on the 
defender network, the team may request ownership of 
the machine. This is done by filling out a “Change of 
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Ownership”’ report in the contest web tools, discussed 
later. The contest administrators then update the own- 
ership of that machine in the database, and the team 
begins to earn defensive points for any services 
offered by that machine. 


Getting Started 


The first step in getting a competition like this 
running is recruiting participants. Each year, several 
members of the administrative team are tasked with 
creating advertising for the contest. Past advertising 
campaigns have consisted of posters and word of 
mouth, but more recently we have invested in a free- 
standing sawhorse sign featuring a skull and cross- 
bones. Once participants are interested, they will need 
to register. 


Team registration is done using a web-based tool 
that stores registration information in a database, 
which is used later for scoring. The registration form 
includes all the basics like name and contact informa- 
tion, in addition to an Operating System preference, 
which allows teams to choose a preference from a list 
of available OS choices. This list includes at least one 
Unix variant, and one Windows variant. The registra- 
tion system should be made available no less than one 
month before the scheduled beginning of the contest. 
This insures that interested individuals have enough 
time to find other interested parties to join with in the 
creation of a team. In order to give the administrators 
enough time to complete the initial setup, registration 
should close approximately one week before the con- 
test begins. Once the number of registered teams is 
known, the real work can begin. 


Initial Contest Setup 


The most time-consuming portion of running this 
contest is the building of machines destined for the 
defender network. In addition to the drudgery of OS 
installation, members of the administrative team often 
sabotage the default security of the OS by adding 
administrative users with easily guessed passwords, or 
disabling common security features. In order to grant 
the teams access to their machines, the administrators of 
the contest must maintain a list of the passwords set for 
the Administrator/root account, so that the passwords 
can be distributed to the contestant team. Since this is 
the least interesting part of running the contest, there are 
efforts under way to automate this part of the procedure. 


In early contests, the participants focused primar- 
ily on defense, which made for a relatively unexciting 
contest. In order to avoid this, a new class of machines 
also inhabits the Defender network for this competi- 
tion. A variety of Non-Player Computers, or NPCs, 
were set up, and deliberately left vulnerable, to give 
participants something at which to point their attack- 
ing skills. These NPCs are built from spare hardware 
found in storage rooms, and frequently include 
devices like printers, or similar networked devices, 
just to keep the competition interesting. Often, the task 
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of setting up NPCs is delegated to a newer student, 
without much experience in system administration. 
The student can then observe his NPC to see how well 
the configuration fares against the attacking skills of 
the contestants and learn from any mistakes. In this 
way, even students involved in the administration of 
the contest can gain experience. 


Since the use of an Attacker network requires 
that the contestants be in physical proximity to the rest 
of the contest network, the contest now needs some 
physical space in which it can be held. This also 
requires that the contest be mobile enough to relocate 
for the duration of the contest. Fortunately, the CtF 
administrative team was able to obtain a spare 19” 
rack, in which they mounted a keyboard, mouse, mon- 
itor, and 16-port KVM switch. The bottom half of the 
rack was used to house the contestant machines, and 
the OpenBSD gateway machine described earlier, 
which also handles some of the duties of scoring. 


Contest Timetable 


On the first day of the competition, a fair amount 
of time must be devoted to administrative details and 
the rest is reserved for teams to work on securing the 
machine they have been assigned. The first adminis- 
trative task is distribute information packets to each 
team. The administrators must verify the identity of 
the team leaders and the team leaders are then 
entrusted with the information packet. This informa- 
tion packet contains an official copy of the contest 
rules and the password for the administrative account 
on their assigned machine on the defender network. 
Also in the packet of information are instructions for 
connecting to the contest network from elsewhere on 
the Internet. With the creation of an attacker network, 
the remaining members of the administrative team are 
making sure that the machines contestants have 
brought with them are properly connected to the 
attacker network. 


For the first day, participants are not allowed to 
attack their opponents. A member of the administra- 
tive team watches the network carefully, using Snort 
[7] as an Intrusion Detection System to catch any 
teams who violate this policy. 


On the second day of the contest, the gloves 
come off, and the participants may begin to attack 
each other. For the remainder of the contest, the 
machines on the defender network will be the target of 
countless port scans and exploits. The second day of 
the contest also marks the beginning of the scoring 
period, which continues until the end of the contest. 


Scoring Mechanisms 


The scoring mechanism used in this contest is 
divided into two types: defensive and offensive. The 
defensive portion of the score is derived from the 
availability of the services offered by the team’s 
assigned machine, as tested by an automated system. 
Offensive scoring has proven very difficult to automate, 
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and is currently handled through a web-based report- 
ing system discussed below. 


In order for an automated service checking sys- 
tem to be effective, it must accurately simulate user 
interactions with the service. If the service verification 
is not thorough enough, a team could replace a com- 
plex service with a simplified “‘stub” service. Fortu- 
nately, the Nessus remote security scanner [8] pro- 
vides a flexible plugin architecture suitable for the 
needs of this contest. Nessus plugins can be written in 
either C, or the Nessus Attack Scripting Language 
(NASL) [9]. All contest-related service tests are writ- 
ten in NASL, and report success or failure using two 
of the built in Nessus plugin return states. The assign- 
ment of status is arbitrary, but must be consistent with 
the Nessus output parser. 


Defensive scoring is divided into “‘rounds”’ that 
last ten minutes. During each round, a Nessus client 
connects to the Nessus daemon running on the gateway 
and requests a test of the defender network using only 
the contest-specific tests. After the server performs the 
checks, the data is fed back to the client as XML. The 
Nessus output is then parsed using XPath expressions 
and the result of each test is recorded in the scoring 
database. Storing the test results in this manner not 
only allows multiple tests for a single service, but if a 
test proves to be unreliable on the contest network, it 
can be disqualified without losing any additional scor- 
ing data. SQL queries can then be used to determine 
which services on each machine passed all tests, and 
should be awarded points. Point values for each ser- 
vice are also stored in the database, so that a team’s 
points could be calculated using a single SQL query. 


Of course, the score of the game is not nearly as 
exciting unless it is displayed for all to see, so some of 
the more graphically gifted members of the adminis- 
trative team put together a scoreboard using Macrome- 
dia’s Flash [10]. The scoreboard retrieves a simple 
XML file containing various scoring-related quanti- 
ties, and displays them in interesting ways. For exam- 
ple, the percentage of running services is represented 
as a gauge, and the overall team score is displayed 
simply as a number. There are many other quantities 
that can be measured in this setting and these quanti- 
ties are presented only as examples. The scoreboard is 
a valuable addition to the contest, as it allows contes- 
tants and spectators alike to get a clear picture of the 
contest standings. 


For the remainder of the contest, the only task 
for the administrative team is to respond to feedback 
from participants. Teams may provide this feedback to 
the administrative team using a set of web tools that 
were created to help the contest run more smoothly. 


Web-Based Administrative Tools 


The CtF web tools were created to facilitate 
communication between the contest admins, and the 
contestants. As the contest became more complex, so 
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have the tools. The tools were originally written using 
AxKit [11], but work is underway to remove that 
dependency. The registration tool is not protected by 
authentication, but all the remaining tools require a 
team username and password unless otherwise noted. 


The most fundamental of the web tools used in 
the administration of the contest is the registration 
form. The registration form is filled out by the teams 
before the contest, and includes information such as 
team name, OS of choice, and the name and contact 
information for each member of the team. This infor- 
mation is recorded in the scoring database for later use 
by scoring software and web tools. 


When a team compromises another machine on 
the defender network, they must fill out an offensive 
report. The offensive report requires that a team spec- 
ify the IP address of the machine they have compro- 
mised, and provide the administrators some measure 
of proof that they have gained elevated access on the 
machine. This can be done in a number of ways, 
including modifying web content or binding a shell to 
a specific port. Any proof of compromise must be ver- 
ifiable without local access to the machine, since the 
contest administrators do not maintain any level of 
access to contestant machines. The administrators then 
use a correspondence web tool to reply to the teams, 
indicating whether their proof was sufficient or not. 


In the event that a team’s assigned machine 
becomes unusable, the team may use the web tools to 
request that their machine be re-installed. Re-installa- 
tion is done at the convenience of the contest adminis- 
trators, but once rebuilt, the machine will come up 
unpatched and mis-configured on a hostile network. 
Because of this, re-install requests are rare, and may 
be removed from the contest in the future. 


If a team gains complete control of another 
machine on the defender network, they may request 
ownership of the machine, in order to gain more 
defensive points for services offered by that machine. 
The web tool requires that the team provide the special 
flag, stored on a CDROM, and readable only by the 
administrative user. When the contest administrators 
receive a change of ownership request, the ownership 
of the machine is updated in the scoring database, and 
the flag on CDROM is changed to prevent the previ- 
ous owners of the machine from re-claiming it. The 
replacement scoring system in development handles 
ownership at a service level, rather than a host level, 
so this system will be obsolete. 


Teams are also encouraged to turn each other in 
for violations of contest rules, since the contest admin- 
istrators cannot hope to catch all violations them- 
selves. If a team believes that they have found a rule 
violation, they fill out a web form detailing whom 
they believe to be violating the rules. The contest 
administrators will then investigate and respond to the 
team who reported the violation. 
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If a team wishes to convey a message to the con- 
test administrators that does not fit into one of the cate- 
gories above, a general comment tool is also provided. 


The administrative side of the web tool suite 
presents the user with a list of reports that have not yet 
been handled. Resolving a report emails the contest 
administrator’s comments to the captain of the team 
who filed the report. These tools allow the contest 
administrators to efficiently handle. feedback from 
contestants. 


This particular game has evolved to its current 
state over the last few years and is only an example of 
the many possible uses for this network. Several simi- 
lar games are in development, including a “King of 
the Hill” type game where contestants would attempt 
to gain root access on a machine, and attempt to main- 
tain that access as long as possible. Both the network 
setup, and the game have evolved a great deal to reach 
their current status, but there is much more work still 
being done to improve them. 


Future Improvements 


Over the years, this contest has improved as 
members of the administrative team have found ways 
to automate and simplify various aspects. In the spirit 
of continuous improvement, there is still a great deal 
of work in progress to make the network easier for 
administrators to set up and the game more enjoyable 
for the participants. 


Network Improvements 


With the creation of the attacker network, the 
administrators found that the network was transferring 
significantly more data than it had before, due to 
downloading and web browsing by the participants. In 
the past, contestants connected to the contest from 
their homes, so the contest was not responsible for 
providing normal access to the Internet. However, 
with the advent of the attacker network, the contest 
network must provide enough bandwidth for teams to 
research service protocols, and investigate possible 
attack vectors. 


In order to achieve this, the routing policies 
described initially will be enforced by the Cisco Cata- 
lyst 3550, which performs many routing functions in 
hardware. Relocating this functionality will increase 
the throughput of the contest network, and free up the 
gateway for other tasks. 


One task that the gateway’s resources could be 
used for is a proxy server. Currently, the routing configu- 
ration of the gateway does not have the ability to restrict 
traffic to the HTTP and HTTPS protocols, merely to 
their respective ports. Using an application-level proxy 
would reduce the potential for misuse of the network by 
ensuring that only valid requests pass through. 


Game Improvements 


The most time consuming task involved in build- 
ing the contest is the bulk creation of machines for use 
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on the defender network. Although identical machines 
can easily be constructed using disk imaging software, 
a network full of identical machines is not very inter- 
esting. Research is currently under way to investigate 
the potential use of other configuration management 
systems, such as Fully Automated Installation [12], or 
Radmind [13] to create basic installations that are sim- 
ilar, but not identical. The use of such systems would 
drastically reduce the amount of work required for 
contest setup, and may allow the contest to be run 
more frequently. 


Automation of the defensive scoring for Capture 
the Flag saved the administrative team quite a bit of 
time and effort, while automating the offensive scor- 
ing could save even more. The automated scoring sys- 
tem in development combines the defensive and offen- 
sive scoring into a single system based on dynamic 
service flags. 


Each service must make the flag information 
available to all clients, and the flag must be located in 
a pre-determined location for each service. For exam- 
ple, the web server on host 10.10.0.3 would make its 
flag available at http://10.10.0.3/flag.txt. Other services 
must make their flags available in a similar manner. 


At the start of the contest, each team is issued a 
special initialization flag. When a team takes control 
of a service, they replace the existing flag with their 
own initialization flag. In the next round, the scoring 
system will recognize that the ownership of that ser- 
vice has changed. 


Each round, the scoring system connects to a 
scoring daemon on the target contestant machine, and 
retrieves the flag for each service from the filesystem. 
The scoring system then connects to the service itself 
and performs a series of validation tests, which 
includes retrieving the flag for the service. If the ser- 
vice flag from the filesystem does not match the flag 
obtained through the service, the contestant has 
attempted to trick the scoring system, and should be 
penalized. If the two flags match, they are compared 
to the expected value stored in the scoring system’s 
database. If the flags match, the service is still under 
the control of the same team, and should receive 
points. If the flag retrieved by the scoring system 
matches the initialization key of another team, the 
scoring system updates the ownership of that service 
to reflect the compromise, but no points are awarded. 
This was done to deter teams from simply replacing 
their flags with initial flags each round. 


If the flag does not match the expected value or 
any team’s initialization key, the flag has been tam- 
pered with, and the contest admins should be alerted. 
As of this writing, the above system is under active 
development, although not yet completed. 


Several contest additions have been proposed to 
add additional realism. The best way to add a realistic 
angle to the contest is to present each contestant team 
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with content for each service they run. Teams would be 
required to serve this content in order to earn points. 
Service tests could then be used to verify that the pro- 
vided content was still in place. The number of services 
could also be narrowed, allowing contestants to focus 
their research efforts on a smaller number of protocols 
and allowing administrators to write more complex 
functionality tests for those services. Bandwidth usage 
and performance are important factors as well, and may 
be incorporated into the contest in the future. 


The bandwidth used by each contestant machine 
could be tracked, and the teams could be ‘“‘charged’”’ 
some number of points, based on their bandwidth 
usage in each scoring round. The tracking of band- 
width usage could easily be performed by the switch, 
but this information must still be fed back into the 
scoring system. A similar “‘charge” could be applied 
for slow response to scoring checks. For example, if 
the response time for a particular host is significantly 
longer than the response of the other hosts, that host 
should be penalized for their poor performance. Imple- 
menting these penalties has not yet been attempted. 


Although this contest has evolved for several 
years, there are still many more performance enhance- 
ments and features that are in development. Sugges- 
tions and feature requests are welcome. 


Availability 


The code and configuration files used in the con- 
test will be made available at http://www.nerdcircus. 
org/ctftools/. 


Conclusions 


There are many reasons for an organization to 
run a competition like this, whether for educating new 
sysadmins, or just to keep current sysadmins sharp. 
Security training exercises like the one described here 
can be a great benefit to any organization. Such exer- 
cises can be used to raise awareness of common secu- 
rity problems and their solutions. In an academic envi- 
ronment, these exercises can be used to give interested 
students a safe environment in which they may 
explore many aspects of security, without endangering 
the security of others. 


Corporate environments stand to gain just as 
much. Using the techniques described here would 
allow organizations to better evaluate the technical 
proficiency of the individuals being considered for 
security-related positions. In addition to entertainment, 
these training exercises could be used to keep existing 
members of security teams at their best. 


The network described here can be built using 
hardware that many organizations may already have 
laying around in storage rooms. 


Although these contests are a valuable learning 
experience, and have numerous other uses, the main rea- 
son that we have continued running them is simple: fun. 
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How to manage security in an environment with no firewalls, 
with all users having root, and no direct physical control of any system 
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ABSTRACT 


PlanetLab is a globally distributed network of hosts designed to support the deployment and 
evaluation of planetary scale applications. Support for planetary applications development poses 
several security challenges to the team maintaining PlanetLab. The planetary nature of PlanetLab 
mandates nodes distributed across the globe, far from the physical control of the team. The 
application development requirements force every user to have access to the equivalent of root on 
each machine, and use of firewalls is discouraged. If an account is compromised, PlanetLab 
administrators needed a way to track the actions of users on the nodes. If an entire node is 
compromised, then the administrators need a way to regain control despite the lack of physical 
access. Encryption was built into PlanetLab to ensure confidentiality and integrity of system 
downloads. A special reset packet, combined with keeping a boot CD in the machine, enables 
PlanetLab system administrators to remotely regain control of machines if they are compromised and 
return to the nodes into a safe known state. The Linux VServer implementation is used to provide 
root access to PlanetLab users for development purposes while isolating users from each other. A 
network abstraction layer provides accounting of traffic and allows safe access to raw sockets. These 
mechanisms have proven very useful in managing PlanetLab. After a compromise of large numbers 
of PlanetLab hosts, control of the PlanetLab network was regained in 10 minutes. The compromise 
spawned a review of PlanetLab security, which pointed out a number of flaws. The need for a central 
site for maintaining PlanetLab was cited as a key weakness. Future work includes distributing the 


functions of PlanetLab’s central administrative database and improving integrity checks. 


Introduction 


The PlanetLab distributed system testbed [1] has 
a number of unique attributes that make security 
administration difficult. PlanetLab is a globally dis- 
tributed network of hosts designed to support the 
deployment and evaluation of planetary scale applica- 
tions by distributed systems researchers all over the 
world. To support their application development 
efforts, PlanetLab users need root access. PlanetLab 
systems are used for a variety of network experiments 
and require unfettered access to and from the Internet. 
Use of firewalls between PlanetLab hosts and the 
Internet is strongly discouraged. Moreover, PlanetLab 
systems are at sites distributed all over the planet and 
are not under the direct physical control of any of the 
PlanetLab administrators. At the same time, PlanetLab 
nodes must be available and usable by the researchers 
while site and PlanetLab administrators need to be 
able to respond to security complaints. 


This paper describes the security challenges 
faced by the PlanetLab administration team. It reviews 
the issues that needed attention, how we dealt with 
them, and our experiences with the implementation.. 
We first describe the PlanetLab environment and the 
system and network requirements of the PlanetLab 
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community. We then describe our design and the 
implementation of that design. After that, we review 
our experiences with this design, followed by sections 
on related and future work. 


Environment and Problem Description 


In this section, we describe the PlanetLab envi- 
ronment and talk about the key security challenges 
facing PlanetLab administrators. 


PlanetLab Environment 


PlanetLab [1] is a distributed-systems testbed, 
allowing the research, development, and prototyping 
of new applications and network services. The testbed 
comprises 433 nodes at 194 sites.’ PlanetLab is truly 
“planetary scale”’ as it is geographically spread across 
five continents and topologically spread across the 
Internet, Internet2, and other networks. Because of 
this geographic and network diversity, the test bed 
provides researchers with a very “real-world” set of 
opportunities and challenges; specifically it allows the 
deployment of, experimentation with, and test/mea- 
surement of services in a non-simulated network. 


PlanetLab nodes are computers (PCs that meet 
an evolving set of minimum configuration requirements) 


‘As of the time of this writing; growth continues. 
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which are hosted by a variety of universities, corpora- 
tions, and other organizations. The nodes are dedi- 
cated running PlanetLab. They run a special version of 
Linux (detailed later in the paper) and are adminis- 
tered remotely (including patching) by a set of admin- 
istrators focus on PlanetLab. The set of initial test bed 
machines was seeded by a grant from Intel Corpora- 
tion, but now new member organizations must con- 
tribute nodes as a condition of joining the test bed. 
Hosting organizations provide electrical power, physi- 
cal space, and network connectivity to their nodes. 
PlanetLab administrators have physical access to 
almost none of the nodes, and the turnaround on phys- 
ical work requests on the machines is on the order of 
days. Administrative and system burden on the host- 
ing organization is deliberately limited (we don’t 
require remote consoles, for example), in order to 
make joining the testbed as painless as possible. 


Security Challenges 


Because of the nature of the research done on 
PlanetLab — requiring unfettered access to the network 
and frequently resulting in non-standard traffic pat- 
terns — the nodes are generally positioned outside of 
the hosting organization’s firewall. A consequence of 
this is that nodes are typically not protected by any 
kind of filtering of inbound traffic, and lack of the out- 
bound filtering permits all kinds of traffic, some of 
which will be interpreted as hostile, to be sent out. 


PlanetLab users perform distributed systems 
research. Accomplishing this frequently requires great 
flexibility on the part of the system — for example, 
requiring root access to perform certain functions, or 
wishing to use the system in odd ways (or replace part 
of the system with their own code). At the same time, 
the node must remain stable enough for use. More- 
over, researchers don’t want other experiments affect- 
ing their experiment (as well as the converse). 


PlanetLab nodes are administratively very com- 
plex — they consist of machines in different networks 
and administrative domains, providing access to 
researchers who are at arbitrary locations in other 
administrative domains and who do odd, non-standard 
things. Keeping control of the nodes is a difficult 
problem. That control rests with a centralized group of 
PlanetLab system administrators who develop and 
maintain the base operating system (including patches 
for security and for PlanetLab functionality) and the 
associated management utilities. 


Host organizations also frequently have require- 
ments that they be able to control nodes on their net- 
work. One concern that host organizations have is that 
PlanetLab nodes would be used for nefarious purposes. 
Sites would need ways to audit PlanetLab usage to help 
them deal with any possible complaints that received 
about PlanetLab node behavior. Despite the trust that 


1The PlanetLab consortium is hosted by Princeton Univer- 
sity, and their staff serve as centralized PlanetLab system ad- 
ministrators. 
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sites have shown by hosting PlanetLab nodes, we antic- 
ipated that at some point PlanetLab nodes could be 
compromised. We needed a mechanism be able to 
remotely regain control of hosts when this happened. 
Nodes would need to be brought to a safe known state 
for forensics and for removal of vulnerabilities. Also, 
the ability to remotely power cycle a node would not be 
enough. While that functionality is extremely useful for 
remote managing machines when they get hung, it does 
not implicitly put the PlanetLab node into a state where 
it can be debugged remotely. 


Summary 


The security key problems faced are summarized 
as the following: 

e Create a full and rich development environment 
where users have tremendous flexibility while 
being isolated from each other and the native 
OS environment. 

e Make it possible, even comfortable for sites to 
host PlanetLab nodes despite possible com- 
plaints about node behavior and the fact that the 
local site does not fully control the node 

e Be able to regain control of PlanetLab nodes 
even if they are compromised. 


Related Work 


There are a number of large distributed/network 
testbeds that deal with similar issues. Emulab [2] is a 
network testbed that has many of the same concerns as 
PlanetLab. Emulab uses the FreeBSD jail [3] to isolate 


_ experiments in a type of virtual machine. PlanetLab 


differs from Emulab in that PlanetLab emphasizes the 
development of services and APIs and also aims to be 
a deployment platform for services. Also, while Emu- 
lab nodes talk primarily to each other, PlanetLab 
nodes are encouraged to and often do communicate 
with non-PlanetLab nodes, making the need for an 
audit trail of PlanetLab node more critical. 


As we describe later, PlanetLab uses a virtualiza- 
tion technology to isolate users from each others while 
giving them a very flexible environment, not unlike 
Emulab’s use of a chroot jail. Related work in the vir- 
tualization area are Xen [4] and Denali [5]. A number 
of PlanetLab nodes are even running on top of Xen. 


Other work has been done to manage an environ- 
ment where users have tremendous flexibility and 
need the equivalent of root. Leon, et al., [6] discuss 
how they manage an environment where all users have 
root. Like the environment described, PlanetLab takes 
advantage of having sophisticated users who are will- 
ing and capable of managing their own environment. 
PlanetLab is not intended as a desktop environment 
where users perform activities such as receive mail. 


The Grid has been compared to PlanetLab in [7]. 
PlanetLab differs in that it is more network centric vs. 
compute-centric than the Grid. Many PlanetLab appli- 
cations, such as network measurement [8] and content 
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distribution [9, 10], rely on geographic and network 
dispersal to be effective, while rarely being CPU 
intensive. Dispersal of Grid resources tend to be more 
accidental than intentional. Also, Grid resources are 
shared (compared to dedicated PlanetLab node) and 
typically more heterogeneous than PlanetLab nodes 
and are not centrally managed like PlanetLab. The net- 
work-centric application mix of PlanetLab, particu- 
larly with network measurement and content distribu- 
tion, makes the audit trail requirement more pressing. 


Security Design and Implementation 


There are many points of control that must be 
managed to provide top-to-bottom control of the sys- 
tem. The first are the users who must use the system’s 
resources in a reasonable way, as not to provoke sites 
into removing the hosts. Next, resource utilization, 
such as network, CPU, and disk space, for each user 
must operate within certain bounds. The platform 
must be remotely controlled. The booted operating 
system and base execution environment must be 
installed securely and controlled remotely. This sec- 
tion describes each of these pieces and how they are 
controlled and secured. 


AUP 


The PlanetLab users operate within the limits of 
a published Acceptable Use Policy (AUP). All Planet- 
Lab users get access to PlanetLab by first creating an 
account at PlanetLab Central. One step in this signup 
is the user’s acceptance of the AUP. Since PlanetLab 
is a testbed for experiments in new Internet technolo- 
gies, it is difficult to enumerate specific limits within 
which users must operate. Of course, malicious activ- 
ity, attempts to subvert the PlanetLab security and 
authorization system, illegal activities, excessive node 
use and activities that exceed the usual limits of net- 
work propriety are called out, but the general rule is 
“do no harm.”’ The AUP instructs PlanetLab users to 
ask what activities would cause network and resource 
alerts in their own site and then consider the same sort 
of limits on the remainder of the PlanetLab modes. 


User Isolation 


To provide a rich development environment to 
users yet provide user isolation, we modified the base 
OS of PlanetLab nodes [11]. PlanetLab administrators 
use a lightweight virtual machine abstraction provided 
by the Linux VServer [12] implementation. Each 
research group getting access to a node receives a 
chrooted virtual Linux machine, which we will call a 
vserver. The user API effectively becomes Linux. 
VServers virtualize machines at the system call level, 
above the kernel. Virtualizing at this level allows us to 
scale to 1000 virtual machines at the cost of weaker 
isolation, something not possible with other VMM 
implementations like VMWARE [13] or Xen [4]. To 
use the PlanetLab network, researchers get “slices” of 
the infrastructure. Slices are collections of accounts on 
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some set of nodes across the network. These accounts 
on a node are isolated within vservers, with the excep- 
tion of some administrative slices. 


Network isolation is achieved through a “safe raw 
sockets”’ implementation [14], part of the SILK pack- 
age, which is derived from Scout [15]. This implemen- 
tation provides controlled access to the network stack 
by what appears to be raw sockets without granting 
root privilege. It also isolates traffic, preventing indi- 
vidual virtual machines from snooping on each other’s 
traffic. In addition to isolating network traffic, SILK 
provides CPU guarantees and enforces usage policy. 
The Linux Traffic control facility [12] is used to man- 
age the bandwidth utilization and implement bandwidth 
policies. We allow site administrators to set the amount 
of bandwidth that each PlanetLab node can use. 


SILK also provides network traffic auditing capa- 
bility. SILK tags each packet with the ID of the 
VServer that sent it and provides an administrative port 
for snooping outgoing traffic. We also created black- 
lists that would prevent a PlanetLab node from con- 
tacting some set of IP addresses. PlanetLab administra- 
tors install these blacklists, and local site administrator 
can request that nodes or networks be placed on them. 
Care needs to be taken in the installation of blacklists 
to prevent nodes from being made totally inaccessible. 
Reporting 

PlanetLab’s geographic distribution makes it 
ideal for mapping the Internet. It seems that many 
researchers first build a “Hello world” application 
that pings other PlanetLab and non-PlanetLab nodes to 
discover timing and connectivity information. 
Repeated pings, IP address space scanning and port 
scanning are just the activities that set off Snort [16], 
and other network monitoring tool alarms. Even some 
‘““well designed” probing applications (i.e., with built 
in flow restriction to avoid complaints) have set off 
alarms. This implies that some sites have very tight 
restrictions on probing and mapping activities. 


To handle an inappropriate traffic incident, we 
need to map the reported activity from the network 
traffic to the experimenter. A traffic report usually 
contains a time and a source and destination IP 
address. Additionally, traffic reports relate to an inci- 
dent in the past. We found that most conventional traf- 
fic monitoring tools are for watching current traffic 
and not recording and querying past traffic. 


These problems (mapping, delay and distribu- 
tion) led to the development of tools which have each 
node collecting information on its own network traffic 
(in and out), saving that information and eventually 
reporting that information to a central repository. 


As mentioned above, the kernel’s network stack 
was enhanced with SILK to return information on 
which IP addresses to which the slivers were communi- 
cating. An administrative application named “netflow” 
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analyzes this information every five minutes and cal- 
culates the “‘flows”’ — the connections from a source to 
a destination by some slice. This information is saved 
to a file. These files are kept on the node and are even- 
tually copied to PlanetLab Central where they are 
available for analysis if problem reports arrive. 


This flow information is then made available on 
each node and from PlanetLab Central. Each Planet- 
Lab node runs a web server on the standard web port 
(80) that gives information about PlanetLab, about the 
node and allows browsing through the flow informa- 
tion. This allows local administrators of sites hosting 
PlanetLab nodes to respond to traffic and security 
alerts. Through the web page, an administrator can 
search for the reported destination IP addresses and 
trace this to the email address of the researcher. If a 
remote (not at the PlanetLab site) network administra- 
tor receives reports about the traffic from a PlanetLab 
node, that administrator can contact the experimenter 
directly. In this way, the reporting facility removes 
PlanetLab administrators from the chain of contacts 
regarding a perceived incident, reducing the time to 
respond to security complaints. Given the exposure of 
this web server (no passwords or accounts are needed 
to access it), web pages are implemented in simple 
HTML (no java, javascript, or PHP) with no user text 
input required for selecting looking at traffic patterns. 


The requirement on the PlanetLab infrastructure 
for providing this service is maintaining a mapping 
between service operators (the researchers). This 
means a verifiable system of user identities and the 
monitoring system that records the user of resources 
and who is using them. PlanetLab thus has a extensive 
system of monitoring resource use and this monitoring 
system is tied into a system that authenticates users 
and which provides a path back to the email of a 
responsible person for any resource use. 


Preventing and Dealing with Compromises 


We knew that there was a substantial chance that 
our exposed network of nodes could be compromised. 
To deal with that possibility, we configured our 
machines to boot only from a CD in a machine. Once 
the machine boots, it downloads via an SSL secured 
connection a gpg signed script to execute as the next 
phase of the boot process. These scripts are used for 
remote re-installation, normal booting, and placing the 
node into a “debug” mode in which the network stops 
all traffic except ssh connections. Since scripts for nor- 
mal reboot are downloaded from a central location, we 
can upgrade the kernel versions used on the hosts with- 
out having to update the CD. During debug mode, 
should the connection to the PlanetLab central website 
become unavailable for any reason, the node will reboot 
and retry the connection at 15 minute intervals. With 
the debug mode, we can bring nodes into a safe known 
state while preserving disk information for forensics. 


The Linux kernel on PlanetLab nodes has been 
modified to reboot when it receives an ICMP trigger 


198 


packet with a unique 128 bit payload which is gener- 
ated for each machine and is re-generated each time a 
machine reboots. We choose 128 bits to make exhaus- 
tive search attacks very difficult against a single node. 
Since each node has a unique packet for reboot, replay 
attack is ineffective against the nodeentire PlanetLab 
network. At worse, a replay attack will only cause a sin- 
gle node to be rebooted. If desired, the machine can be 
forced to come up into a special debug mode, to which 
as described above, limits access while allowing for 
forensics. While effective, this software reboot mecha- 
nism suffers from the problem that the machine must 
have a working network stack, and connectivity to the 
internet in order to ensure a reboot. With the widescale 
filtering of ICMP traffic following the SQL Slammer 
worm, we now recommend that PlanetLab sites install 
remote power switches on their nodes. 


Experiences 


Many of the security features implemented in have 
proven very useful. Our reporting mechanisms have 
defused many incidents after a network experiment trig- 
gered an overly sensitive Intrusion Detection System 
(IDS). Remote control and access made recovery from a 
system compromise quick and effective. This section 
will describe some of the incidents and successes of the 
PlanetLab security and control mechanisms. 


User Behavior and AUP 


Since the current direct users of PlanetLab are 
researchers who generally understand operation on the 
Internet, there have not been many incidents that 
required enforcement of the Acceptable Use Policy 
(“AUP”’). We have not yet had to revoke any access 
from user, which would be the ultimate penalty associ- 
ated with AUP violation. 


Problems with PlanetLab user behavior have 
been studied [17] and fall into two categories: pro- 
gram failure and accidental network traffic alerts. 
Building distributed, decentralized applications is hard 
and, of course when you have lots of projects building 
them, there will be bugs. PlanetLab Central will 
receive reports or will notice excessive node resource 
use (e.g., no file descriptors) or excessive network 
traffic (e.g., too many external computers accessed or 
excess volume) and PlanetLab Central sends email to 
the researcher. In all cases, the researchers have 
responded to the situation. 


Measuring the Internet generates lots of probes 
and pings. A simple mapping experiment, generating a 
small amount of data and performing a straightforward 
measurement set off alarms at many locations. In this 
case, and in others, a measurement experiment has the 
same network traffic profile as a worm looking for 
hosts to infect (probing port 80 is a feature of 
CodeRed/NIMDA). It is against the PlanetLab 
Acceptable User Policy to generate “‘disruptive”’ net- 
work traffic but it’s sometimes hard to know what type 
of traffic would be considered disruptive. 
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There have been many incidents of ‘“‘attack’’ or 
“worm” reports that were traced back to a measuring 
or topology experiment. The resource monitoring sys- 
tem allows forwarding of the reports to the researchers 
and, in all cases, the researchers responded appropri- 
ately to the situation. 


Monitoring 


We put a lot of energy into providing ways for 
network and security administrators to determine for 
themselves what researcher generated traffic and how 
to contact them. While these facilities did reduce the 
workload from dealing with security complaints about 
PlanetLab node behavior, we continue to have prob- 
lems from overzealous intrusion detection systems. 
Some organizations set up intrusion detection systems 
(IDS) to trigger on relatively innocuous things as a 
traceroute. One organization went as far as to threaten 
lawsuits if behavior persisted, and this tactic proved 
successful in getting a number of PlanetLab hosts 
pulled off networks. Sometimes complaints were justi- 
fied, as some researchers experiments generated what 
would have to be interpreted as an attack — large num- 
bers of connections attempted to a range of IP 
addresses in a domain. In any case, we anticipate that 
poor experiment behavior and overly sensitive IDS 
will continue to cause problems. 


Compromise and Recovery 


We had anticipated that at some point, PlanetLab 
nodes would be compromised, and we did have an 
incident where large numbers of PlanetLab nodes 
were compromised. The early implementation of Plan- 
etLab had accounts that were not virtualized — they 
had access to the native operating system. An SSH 
key to a non-virtualized account was compromised, 
and that key was used to log into a number of nodes. 
Since the account was not isolated within a VServer, 
the attacker used his access to the native operating 
system obtain root. When we received notice that a 
number of PlanetLab hosts had been rooted, we used 
the reboot feature of PlanetLab nodes to force all of 
PlanetLab into a known safe state in 10 minutes. 


We took a number of actions in response to the 
compromise. Forensic analysis, enabled by debug 
mode, determined that the nodes were rooted using a 
vulnerability that we had plans to patch. We had just 
begun to roll out a version that was not vulnerable to 
the exploit when we were attacked. We also elimi- 
nated general purpose slices that were not isolated 
with VServer accounts. At the same time, we made all 
user slices dynamic and eliminated static VServer 
slices. Slices would not be assigned by default to all 
nodes, which would limit the access to nodes if a slice 
private key were compromised. In addition, slices 
would have a finite lifetime. They would not last 
indefinitely, and would need to be renewed. This idea 
brings us closer to the idea of least privilege for slices 
— slices would only be instantiated on nodes that they 
needed and only for as long as they were needed. 
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A Security Review 


We had fixed the more obvious problems, but 
what about problems we had not yet anticipated? 
Another response to the incident was to have more 
eyes looking at the problem, so we conducted a review 
of PlanetLab security. We wrote a summary of Planet- 
Lab’s architecture and implementation and had it 
reviewed by a variety of security researchers and prac- 
titioners and PlanetLab users. This review proved 
quite valuable, finding a variety of vulnerabilities and 
areas of improvement at both the implementation and 
architecture levels. 


One key problem found was the dependency on a 
single instance of PlanetLab Central for key Planet- 
Lab’s operations, such as slice creation and deletion 
and software updates. A compromise there would lead 
to a compromise of all PlanetLab nodes. A DOS 
attack on PlanetLab Central, while not rendering Plan- 
etLab unavailable, would make many key PlanetLab 
functions unusable. 


Another problem regarded resource manage- 
ment. We need better ways to monitor and manage 
resources such as slices, CPU, disk, and bandwidth 
utilization. A runaway process could render nodes 
unusable and take up so many resources that it would 
be nearly impossible to log in and fix the problem. In 
addition, much of resource management and the secu- 
rity associated with it is hard to use. Slices can only be 
created the principal investigator at site. As a site PI 
this is usually a busy professor, this leads to a ten- 
dency of the PI to share his password, and we have 
evidence of this. Also, the dynamic slice mechanism 
provides no warning when slices will expire and all of 
our user’s work will disappear. As a result, there is a 
perverse incentive to create slices that live as long as 
possible. Some of PlanetLab users had a contest to see 
who create the longest lived slice. Many of the 
reviewers mentioned that security that is hard to use 
will usually be worked around, as demonstrated. 


Related to the resource management issues is the 
need for better intrusion detection and prevention. 
While we have worked to improve the isolation of slices 
from each other and then real operating system, if a 
PlanetLab slice is compromised, the attacker has a large 
amount of resources available to him. We need ways to 
detect resource misuse and intrusion. Also, we need bet- 
ter ways to authenticate and authorize users. Relying on 
a single database of information run by a single organi- 
zation to authenticate and authorize users is not likely to 
scale or be secure. As the number of users and organiza- 
tions using PlanetLab increases, it is unlikely that Plan- 
etLab administrators could revoke them when those 
users leave an organization. Instead, a federated 
scheme, where access is granted to some institutions 
and those individual institutions manage who has valid 
access, is more likely to be successful in the long term. 
PlanetLab does not allow easy ways for slices to 
authenticate and authorize each other. As a result, some 
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users have poor practices such as leaving SSH private 
keys on nodes. One reviewer pointed out the use of any 
reusable password is not vulnerable to attack. The 
attacker who compromised PlanetLab set up sniffers 
that listened not to network interfaces but to TTY ports. 
In this case, SSH does not help as data streams are 
decrypted when the attackers are listening to them, so if 
users actually did put pass phrases on their SSH private 
keys, those pass phrases could be compromised. 


Future Work 


A large focus of future work is on the dependen- 
cies on a single centralized facility for much of Planet- 
Lab’s operations. Key dependencies are in PlanetLab 
Central are being analyzed, and a more formalized threat 
assessment matrix will be created. Creation of a Red 
team for more formally and thoroughly analyzing secu- 
rity weaknesses is also being considered. Integrity 
checkers such as chkrootkit [18] and rkdet [19] and 
other IDS like features are being evaluated and tested. 
Longer term, the architecture for PlanetLab management 
is being studied to make it more secure and scalable. 


Conclusion 


PlanetLab’s security mechanisms have worked 
relatively well so far. The VServer mechanism effec- 
tively gives PlanetLab users a whole virtual machine 
to use and configure while isolating them from each 
other and the native operating system. The PlanetLab 
user account system allows network and security 
administrators a way to determine the source of prob- 
lematic traffic. While PlanetLab has hosted hundreds 
of projects and researchers, and a major compromise 
was dealt with swiftly and effectively using Planet- 
Lab’s reboot mechanisms. A review of PlanetLab’s 
architecture and implementation has yielded a number 
of areas of improvement, such as the vulnerability of 
having a single point of control, the need for better 
resource management, and the need for improvements 
in authentication and authorization. 
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Secure Automation: Achieving Least 
Privilege with SSH, Sudo and Setuid 


Robert A. Napier — Cisco Systems 


ABSTRACT 


Automation tools commonly require some level of escalated privilege in order to perform 
their functions, often including escalated privileges on remote machines. To achieve this, 
developers may choose to provide their tools with wide-ranging privileges on many machines 
rather than providing just the privileges required. For example, tools may be made setuid root, 
granting them full root privileges for their entire run. Administrators may also be tempted to create 
unrestricted, null-password, root-access SSH keys for their tools, creating trust relationships that 
can be abused by attackers. Most of all, with the complexity of today’s environments, it becomes 
harder for administrators to understand the far-reaching security implications of the privileges they 
grant their tools. 


In this paper we will discuss the principle of least privilege and its importance to the 
overall security of an environment. We will cover simple attacks against SSH, sudo and setuid 
and how to reduce the need for root-setuid using other techniques such as non-root setuid, 
setgid scripts and directories, sudo and sticky bits. We will demonstrate how to properly limit 
sudo access both for administrators and tools. Finally we will introduce several SSH techniques 
to greatly limit the risk of abuse including non-root keys, command keys and other key 


restrictions. 


Introduction 


Since its introduction in 1995 by Tatu Ylonen, 
SSH has quickly spread as a secure way to login and 
run commands on remote hosts. Replacing the previ- 
ous r-commands (rsh, rexec, rlogin), SSH provides 
much needed encryption and strong authentication 
features. Relying on public/private key techniques, 
SSH is very resistant to man-in-the-middle, IP spoof- 
ing and traffic sniffing attacks, all of which were sig- 
nificant problems with the r-commands. SSH was ini- 
tially released under a free license, but has since split 
into commercial' and free versions. In this paper we 
will focus on the most popular free version, OpenSSH. 


Sudo was developed in1980 to allow users to 
execute commands as root without using the root pass- 
word. Today it provides per-host and per-command 
access control features and powerful logging facilities 
to track what is done by whom. 


Setuid (also called “suid” or “Set UID’’) allows 
a UNIX program to run as a particular user. If the exe- 
cutable is owned by root for example, the program 
will run as the root user, giving it privileges that may 
be needed for its function. The passwd password- 
changing program is a good example of this, since it 
requires root privileges to write to /etc/shadow which 
holds user passwords. Setgid provides the same func- 
tionality for UNIX groups, giving the program access 


'The commercial version of SSH is owned by SSH Commu- 
nications Security. The parts of this paper which refer to 
“commercial SSH” are based on SSH Secure Shell 3.2. Start- 
ing with version 4.0, this product is known as SSH Tectia. 
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to files writable only by a particular group. For exam- 
ple, in FreeBSD programs that read system memory 
are setgid to a special kmem group. 


These tools and features are available for all 
modern versions of UNIX, and are installed by default 
on most of them. All of them can be used to help 
enhance the principle of least privilege, which we will 
discuss at length here. 


The Situation 


Consider the following system administration 
environment and compare it with your own experience: 


e SSH has universally replaced rsh, but null-pass- 
word keys have been deployed to provide unre- 
stricted root access to automation tools; 

Sudo has replaced the root password for most 

administration functions, but admins generally 

only use it to obtain root shells and almost 
never employ it in automation tools; 

Custom setuid scripts almost exclusively run as 

root and setgid is seldom used; 

e Automation tools that require any root access, no 
matter how little, run as unrestricted root through 
root cron, root ssh and similar mechanisms; 

e Automation tools receive little security review, 
even when granted wide-ranging privileges. 


Such environments have been the norm in the 
author’s experience. If you have a similar environ- 
ment, this paper will introduce the ideas behind least 
privilege and how these tools can be used to enhance 
least privilege in your environment. 
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The Risks 


Some of the risks in the environment described 
above include: 
¢ Null-password root SSH keys.? If an attacker 
can get to that key, she will have complete con- 
trol over all machines that accept it. Even if your 
application is secure, any mechanism that an 
attacker can use to get to that file is fair game. 
Sudo passwords. Every account that has unre- 
stricted root sudo access is another root-equiva- 
lent password for an attacker to guess or steal. 
¢ Sudo hijacking. In sudo’s default configuration, 
an attacker who can run commands as a sudo- 
enabled user can hijack that user’s sudo privi- 
leges even without access to the user’s password. 
¢ Sudo escalation. It can be extremely challeng- 
ing to limit sudo access to a few commands. 
Without great care, limited sudo can be trivially 
translated into full sudo access. While you may 
trust the user you granted access to, do you also 
trust the attacker who has stolen his identity? 
¢ Script exploitation. Scripts that run as privileged 
users are obvious targets for attackers. Errors in 
the scripts are subject to exploitation. Setuid 
scripts are particularly susceptible because they 
are often written in scripting languages like Perl 
or Bourne Shell and can be read by an attacker 
searching for vulnerabilities. 
We’ll discuss how to mitigate all of these. 


The Causes 


It’s tempting to simply blame “‘coder laziness”’ 
for this situation, but this isn’t the case. There are sev- 
eral factors that we will need to address: 

e Trust in “instant security.”’ Neither SSH nor 
sudo can be simply “dropped in place” and 
deliver an ideal security environment. While 
SSH is far better out of the box than rsh, it has 
its Own security issues that have to be consid- 
ered, and converting automation tools to use it 
can be difficult without tearing down some of 
its benefit. Similarly, sudo introduces several 
security concerns, some of which are worse 
than what it replaces (such as a greater number 
of root-equivalent username/password combi- 
nations). This is not to discourage the use of 
these tools, but they do not magically instill 
security on their own. 

e Lack of best practices guides. There are limited 
resources available explaining the best way to 
set up SSH and sudo. Out of the box, sudo does 
not even have all of its security features turned 
on and is subject to hijacking (as we’ll discuss 
below). SSH command keys are mentioned in 
the man pages, but there are few resources 
really explaining their use or the use of other 
SSH key restrictions. 


2Throughout this paper, the term “SSH key” will be used 
to refer to both RSA and DSA keys. 
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e Added complexity. Many of the techniques in 
this paper increase the complexity of develop- 
ing and deploying automation scripts. Automa- 
tion is hard enough to just get working, let 
alone get working securely. If developers are 
only rewarded for functionality, then there is 
little incentive to take on the added support 
headaches of a more secure solution. 


This paper will address the first two causes. 
Addressing the third is often a cultural and infrastruc- 
ture challenge that can only be solved on a case-by- 
case basis. 


The Goal: Least Privilege 


Now that we’ve discussed what may be wrong 
with our environment, what do we want our environ- 
ment to look like? In this paper we will mostly focus 
on least privilege, which is one piece of the bigger 
goal of layered security. 


Layered security means that safeguards overlap 
such that if one fails, an attacker will still not have 
damaging access. Least privilege helps ensure that if a 
particular user’s account is compromised, for whatever 
reason, the damage the attacker can do with it is limited 
as much as possible. This is why “‘don’t you trust me?” 
should never be the argument for excessive privileges. 
Wherever possible, trust should be compartmentalized. 


UNIX-like systems provide numerous ways to 
restrict privileged access. In this paper we will discuss 
the following techniques: 

e Restricting SSH connections in what they can 
execute and where they can originate; 

¢ Limiting privileged access through sudo by 
coupling it with non-root setuid; 

¢ Replacing root-setuid with non-root setuid and 
setgid; 

¢ Reducing the number of privileged processes 
with sticky bits and setgid directories. 


Whenever a process or user needs elevated privi- 
leges, it should be second nature to ask precisely what 
privileges the process or user needs, and how to best 
limit the process or user to exactly those privileges. 


When discussing the principle of least privilege, 
one might ask “‘why would we have hired these people 
if we didn’t trust them?” Least privilege has little to 
do with the trust we have for our employees. Instead, 
it deals much more with the number of avenues an 
attacker has for exploiting the system. Of course an 
administrator should have every access she needs, but 
conversely she should have no access that she has no 
need for. How strictly ‘“‘need” is defined is a serious 
trade-off to consider, but just requiring that an admin- 
istrator explicitly request specific access, even if it is 
always granted, can go a long way towards controlling 
the number of avenues an attacker can use. If an 
attacker is successful, being able to enumerate the 
accounts with access is also a major benefit to investi- 
gators in determining possible further compromises. 
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When granting privileges to automation tools, 
one might assume that the security of a particular tool 
isn’t very important if the data it deals with is non-sen- 
sitive. It is critical to always consider how a tool could 
be exploited to attack other parts of the network, not 
just the parts it’s intended to control. 


Moving towards least privilege, especially for au- 
tomation tools, has other benefits. Establishing least 
privilege requires developers to understand the privi- 
leges actually used by their tools, which in turn forces 
them to understand what their tools are doing. Under- 
standing software is a key step towards maintaining it. 
Moreover, simply enumerating the privileges that a tool 
requires can help a developer see how to reduce the 
number of privileges required. Does the tool really need 
the ability to “run an arbitrary command on any host in 
the system’”’ or did it really just need the ability to “get a 
directory listing for a specific directory on three hosts?” 


Least privilege is a philosophy, not a technology. 
By consistently employing it, an organization can bet- 
ter understand and control the security of the environ- 
ment while still maintaining a strong culture of trust 
for the administrators. 


Hardening the Environment 


This paper focuses on automation techniques, but 
some basic environment hardening will set the stage 
for a secure automation environment. 


Understanding the Environment 


In a complex environment with many users and 
administrators, it is easy for trust relationships to grow 
throughout the system with little documentation or 
understanding? To combat this, it is helpful to create a 
directed trust graph of your network, indicating partic- 
ularly how root can move through the system using 
SSH, rsh and other mechanisms (such as custom 
administration daemons and web scripts that are some- 
times developed in large environments). There are few 
tools to automate this today, but even manually devel- 
oping such a graph with tools like Microsoft’s Visio or 
AT&T’s Graphviz can provide significant insight into 
your environment. 


Understanding what users and hosts are trusted 
with wide-ranging root access provides a road-map for 
improving enforcement of least privilege. There will 
always be a few places in any large system that require 
broad trust; understanding these will give a roadmap 
for hardening. 


Similarly, administrators should maintain a cata- 
log of known setuid and setgid programs and audit 
systems regularly for the creation of new ones. 


Hardening and Managing SSH 


Authorized Keys 


By default, SSH relies on files in the user’s home 
directory for certain authentication options. Chief among 
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these is the authorized keys? file. This file defines 
what keys will be accepted without a password and 
under what conditions, and will be the subject of sev- 
eral SSH techniques in this paper. Anyone who can 
write to this file for a particular user can log in as that 
user. This means that user home directories, particu- 
larly the ~/.ssh directory, are highly sensitive. Unfortu- 
nately if home directories are NFS mounted, there are 
a number of ways that attackers may be able to write 
to arbitrary user directories, and thereby update autho- 
rized_keys with keys the attacker controls.‘ The solu- 
tion is to move authorized _ keys out of the users’ NFS- 
mounted home directories and onto local storage, gen- 
erally under /var. For example, the following setting in 
sshd config will read authorized_keys from /var/ssh/ 
user/authorized_keys: 


AuthorizedKeysFile /var/ssh/%u 


Of course you will need to create directories for 
the users under /var/ssh which only they can write to. 
Users will also need to create separate autho- 
rized _keys files for every server. This differs from 
many users’ behavior of setting up a single autho- 
rized_keys file for all servers (since it is mounted by 
NFS). While this has some overhead, it once again 
encourages the principle of least privilege in that only 
machines for which the user explicitly requests pass- 
wordless connections will accept them. 


Root Keys 


Unrestricted SSH keys accepted by root are 
extremely powerful and should be avoided. Adminis- 
trative users should generally use their own credentials 
to log into a server and then use sudo to gain root 
access there. Automation scripts that require remote 
root should use command-keys, which will be dis- 
cussed further in ““Command Keys.” To enforce this, 
the PermitRootLogin option in sshd_config should be 
set to forced-commands-only. 


Known Hosts 


SSH provides powerful features to prevent server 
spoofing and man-in-the-middle attacks. Most notable 
is the use of public keys to strongly identify servers. 
This technique is not fool-proof however. SSH keys 
cannot be signed as X.509 certificates are, so unless 
you’ve received the server key from a trusted source, 
you have no way to know that the key is legitimate. 
There are three primary ways to get server keys: 
LDAP, centrally managed ssh_known_hosts and user- 
managed known _ hosts. We will also briefly discuss 
using X.509 server certificates with commercial SSH. 


Commercial SSH allows server keys to be cen- 
trally stored in LDAP, which is generally easiest to 


3This paper uses the OpenSSH filenames and formats for 


configuration files. Commercial SSH uses slightly different 
file names and in some cases formats. 

4Computer Incident Advisory Capability, CIAC Notes 
95-07, ‘““NFS export to unprivileged programs.” See http:// 
ciac.|InI.gov/ciac/notes/Notes07.shtml . 
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manage. OpenSSH and most free Microsoft Windows 
clients (such as Putty) cannot retrieve server keys 
from a central LDAP server, but for installations using 
commercial SSH, managing the server keys centrally 
is highly recommended. Whenever a server key is 
generated, it should be added to LDAP. For environ- 
ments with multiple networks supported by different 
organizations, or for dealing with servers outside of 
your environment, commercial SSH supports multiple 
LDAP servers. 


All UNIX SSH clients support a file called 
ssh_known_hosts, generally stored in /etc or /etc/ssh, 
which contains the official list of server keys. This file 
must somehow be distributed to all clients. This file 
could also be NFS mounted, but this reintroduces the 
NFS security problems discussed above. Even so, NFS 
mounting this file may be better than not managing 
ssh_known_hosts at all. Centrally managing ssh_known_ 
hosts is generally only effective within an organiza- 
tion. Since there can be only one file and it needs to be 
read from disk, there is no good way to include other 
organizations’ host keys. Furthermore, since the users 
must trust the provider of the central ssh_known_ hosts 
file to provide legitimate keys, this file can only be 
accepted as far as trust extends within the environment 
(generally as far as the central support organization). 


If a server is not listed in the central ssh_ 
known_hosts, SSH will by default prompt the user to 
add the key to the user’s known_hosts, stored in ~/.ssh. 
This is the least secure option, since the user has no 
good way to determine the authenticity of the key. 
Once a key has been added to the user’s known_hosts, 
however, SSH will warn the user if a server ever 
responds with a different key. This could indicate that 
a machine is being spoofed. Unfortunately it could 
also mean that the machine has been legitimately 
replaced. In environments where this is common, 
users have no good way to determine whether the 
warning is legitimate. To avoid these problems, it is 
highly recommended that ssh known hosts be cen- 
trally managed rather than rely on users’ known_ hosts. 


Failing to centrally manage ssh known _ hosts 
creates special problems for automation scripts. Since 
scripts have no way to respond to the new key, they 
will fail if the key changes. This is a good thing in that 
it protects scripts from machine-spoofing, but it does 
create administrative headaches when scripts start fail- 
ing due to a key change. Once again, the best solution 
to this problem is central management of ssh_known_ 
hosts. 


Commercial SSH improves this situation by allow- 
ing servers to use signed X.509 certificates rather than 
SSH keys. Since these keys are signed by a Certificate 
Authority, clients can rely on their authenticity without 
having all the keys in advance, greatly simplifying the 


‘This works for managed UNIX clients, but has no good 
parallel for Windows clients. 
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administrative overhead of key management. Since 
most free clients (including OpenSSH) do not support 
these certificates, they are most useful in a completely 
commercial SSH environment, but in such an environ- 
ment they are highly recommended as an alternative to 
ssh_known_hosts or LDAP. 


Hardening Sudo 


Unrestricted sudo effectively creates additional 
root-equivalent passwords for an attacker to guess or 
steal. Each administrator’s password must now be pro- 
tected with the same care as the root password. There 
are two approaches to mitigating this risk. Sudo- 
enabled administrative accounts can be separated from 
the administrator’s regular account. Doing so will 
greatly reduce the opportunities for an attacker to steal 
the sensitive password. Alternately, sudo can be com- 
piled to work with several one-time password systems 
such as OPIE, S/Key and SecurID. Deploying such 
systems is non-trivial, can be expensive and is beyond 
the scope of this paper. 


Sudo has a significant security flaw in its default 
configuration that permits hijacking in which an attacker 
can make use of the victim’s sudo privileges without the 
victim’s password. Sudo uses tickets, files that are cre- 
ated to only require a user to enter her password at cer- 
tain intervals. By default these tickets are created on a 
per-user basis, so if the user is logged on multiple TTYs 
on the same host, her ticket is valid for all of them. 
While modestly convenient, this is a significant security 
hole. If an attacker is able to run an arbitrary process as 
the victim user, then the attacker can piggyback on the 
victim’s sudo privileges even without the victim’s pass- 
word. When the victim uses sudo, the attacker then has 
a five minute (by default) window to use sudo without a 
password. Coupled with the NFS authorized keys 
attack discussed above, this is a very significant attack 
against administrative users.® 


There is a complete but inconvenient solution to 
this, and an incomplete but fairly easy solution. The 
complete solution is to turn off password caching 
entirely, either by compiling with --with-timeout=0 or 
by setting passwd_timeout to 0 in the central sudo con- 
figuration file, /etc/sudoers. Doing so completely 
closes this particular attack, but strongly encourages 
system administrators to use a root shell to avoid retyp- 
ing their passwords repeatedly. Since root shells cannot 
be easily logged, this is a significant auditing trade-off. 


The less drastic solution is to compile sudo with 
--with-tty-tickets or set tty_tickets to ‘“‘on” in sudoers. 
This will create a separate ticket to each user/TTY 
combination, stopping an attacker from piggybacking 
on the ticket in many cases. This is not a complete 
solution, however. The attacker can still attack the vic- 
tim’s login scripts to have the attack happen within the 


§ssh-agent [OSSH] can be similarly attacked in order to 
make use of another user’s SSH key. This seldom impacts 
automation tools because they are less likely to use ssh- 
agent, but it is worth keeping in mind for administrators. 
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victim’s TTY. The attacker can also attempt to login to 
the server immediately after the victim logs off. On 
many operating systems (including Solaris, *BSD, and 
Linux) the attacker will often be allocated the same 
TTY as the victim had, and the ticket may still be 
valid. This latter attack can be mitigated with logout 
scripts that run “sudo -k”’ to destroy tickets, but it can 
be challenging to ensure that all administrators run 
this logout script. So turning on TTY tickets is better, 
but to completely close this hole, password caching 
has to be turned off. 


It is very difficult to manage sudo such that users 
cannot escalate their privileges. This will be discussed 
further in ‘Controlling Sudo.” 


Limiting Privilege with SSH Command Keys 


The most significant way to limit the power of an 
SSH key is to apply a command restriction. When a 
user connects using an SSH key with a command 
restriction, or a command key, a pre-defined command 
runs rather than providing the user with a shell. 
Applying this to root access, along with setting Per- 
mitRootLogin to forced-commands-only,’ provides a 
powerful way to control automation tools. If the auto- 
mation tool runs as a non-privileged user and only has 
access to a particular root command key, then that tool 
can get the root access it needs while reducing the 
ability to subvert it into performing arbitrary actions 
as root. 


For most of the examples in this paper, we will 
consider the same simple task. We will change 
Apache’s ErrorLog entry on a remote host to include 
the current month and restart Apache. This is a some- 
what contrived example, since this would generally be 
done in simpler ways, but it demonstrates some of the 
main issues. The script we wish to run, update_error- 
log, is shown in Figure 1. 


To create a command key that runs update_error- 
log, first create a keypair on the source machine: 


$ ssh-keygen -t dsa -f errorlog_key 


You now have a public key called errorlog_key.pub 
and a private key called errorlog key. Prepend this 
with your command restriction and. append it to 
“root/.ssh/authorized_keys on the target machine. The 
format is as follows: 


command="/usr/bin/update_errorlog" 
[public_key] 


’This only allows root to accept command keys, so there 
cannot be root-level SSH login keys. 


#!/bin/sh 
PATH=/bin:/usr/bin:/usr/sbin 
date=‘date +%F‘* 

## Rewrite httpd.conf 
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Now login using the new key and the script will run: 


$ ssh -i errorlog_key \ 
root@target.example.com 


Non-root Keys 


Many remote functions do not require root 
access at all. By creating special users for these func- 
tions and providing them distinct SSH command keys, 
attackers who are able to steal the key will have 
extremely limited access. 


This can be combined with sudo to provide func- 
tionality very similar to root command keys. By grant- 
ing the special user specific sudo privileges, it is pos- 
sible to create scripts that use root precisely when they 
need it and no more. As an example, we’ll run 
update errorlog (Figure 1) using a non-root SSH key. 


On the target machine, create a new group 
apacheconf that can write to httpd.conf. We don’t want 
to use the apache group itself, because httpd should 
not be allowed to write to its own configuration files 
(otherwise a security flaw in Apache could be used to 
reconfigure Apache). Use a low-numbered GID to 
help distinguish it from user accounts. Put httpd.conf 
into the apacheconf group so that our new group can 
manage it without root access. 


Now create a new user, updatelog, to run 
update errorlog. Put it into the apacheconf group and 
give it a low-numbered UID to help distinguish it from 
user accounts. 

Our update _errorlog (Figure 1) script now needs 
a small modification, adding “sudo”: 

ee 
sudo /usr/sbin/apachectl graceful 
Edit sudoers to grant the errorlog account permission 
to run ‘‘/usr/sbin/apachectl graceful”’. 

Finally, set up a command key as we did in the 
‘‘Command Keys”’ section, but instead of making it a 
root SSH key, make it an SSH key for errorlog. 

We can now restart update httpd.conf from 
source. example.com: 


sourceS ssh -i errorlog_key \ 
errorlog@target.example.com 


Originator Restrictions 


Keys (both regular keys and command keys) can 
be further restricted to specific originating hosts using 
the ‘‘from”’ option in authorized_keys. For example: 


from="*.example.com,*.example.net" 
ssh-rsa AAA.oeTp0O=rnapier@adminhost 


perl -eip "s!*ErrorLog(.*)!ErrorLog /var/log/error_log.$date!" \ 


/etc/httpd/conf/httpd.conf 
## Restart Apache 
apachectl graceful 


Figure 1: update_errorlog. 
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This key will only permit connections from 
machines within the example.com or example.net 
domains. The “from” option accepts a comma sepa- 
rated list of canonical names or IP address including 
wildcards. Note that rmapier@adminhost is only a 
comment to give a hint where this key was created and 
has no impact. 


Originator restrictions based on DNS names are 
reliant on trustworthy name services, but this still 
greatly increases the complexity of attack. The 
attacker must already have stolen and possibly cracked 
the private key, and then will still have to poison or 
compromise DNS in order to make use of that key.® 


Other Restrictions 


Most extended SSH features can be turned off on 
a per-key basis. This includes X-forwarding, port-for- 
warding, PTY-generation? and similar features. It. is 
generally a good idea to turn off any features you 
don’t need. For example: 
no-port-forwarding, no-Xll-forwarding, 
no-agent-forwarding,no-pty ssh-rsa 
AAA. ..oeTp0O=rnapier@adminhost 


Controlling Sudo 
The Pitfalls of Limited Sudo 


Using sudo to give limited access to root is a 
very tricky proposition, since the most obvious sudo 
configurations can be easily escalated to unlimited 
root access. As we discussed in the Introduction, even 
if you trust the user not to do this, you also have to 
trust the attacker who gains access to the user’s pass- 
word (or subverts sudo in some other way). 


Some exploitable situations include: 

e Permission to run commands in a user-writable 
directory. 

e Access to chmod (even more easily exploitable 
with access to chown or cp) 

e Access to any command with shell-outs (vi, 
emacs, ed, edit, more, less, find), though ver- 
sion 1.6.8 promises to help here 

e Access to any command that can write (espe- 
cially append) to an arbitrary file (vi, emacs, ed, 
edit, tee, less) 

e Access to root’s crontab or atjobs (crontab, 
batch, at) 

e Any command that honors PAGER, EDITOR, 
or VISUAL (man, less, more) 

e In some cases, any command that can read an 
arbitrary file (cat, less, more, tail). These can be 


8SSH protects clients from connecting to the wrong server 
through host keys, but it doesn’t protect servers from hostile 
clients. If a user shows up with the correct user key, no 
client host key checking is done. Even with the “from” re- 
striction, only the DNS name is checked, not a host key, 
since there often will be no host key for a client. 

9Many UNIX commands, most notably Is, have different 
newline handling if there isn’t a PTY. If your tool can’t han- 
dle this, you may need to allow PTY creation. 
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used to get /etc/shadow for offline cracking, or 
can be used to read other protected files 

e Access to sudo itself as root. This allows 
attacks like “sudo sudo /bin/sh”. There are 
options to prevent this (--disable-root-sudo at 
compile time, or unsetting root sudo in sudo- 
ers), but these are fairly weak protections meant 
to stop administrators from circumventing a 
'SHELLS entry. If you need these options, then 
you’re probably allowing so many other com- 
mands (like those above) that an attacker can 
easily gain a root shell anyway. 


With the release of sudo 1.6.8, two new features 
have been added that make limited sudo somewhat 
easier to implement. A common sudo need is to allow 
the editing of a protected file. This has historically 
been very difficult to provide in a controlled way with- 
out writing wrapper scripts. A user who is allowed to 
run an editor as root can almost certainly modify arbi- 
trary files and trivially gain shell access. The new 
“-e” option to sudo, also accessible by running 
sudoedit, fixes this. It makes a temporary copy of the 
target file that is owned by the user. The user is then 
provided the editor of their choice, but since they are 
still running under their own userid there is no security 
issue. When the editor exits, sudo will replace the 
original file with the temporary copy. In the past, some 
administrators have written scripts to do just this, but 
moving this functionality into sudo itself should make 
things much easier. To allow a user to use sudoedit, 
treat it like any other command, but don’t give a full 
path to it. The alias ‘“‘sudoedit’’ represents either 
sudoedit, or “sudo -e”’. By appending a filename, you 
can restrict the user to editing particular files. For 
example: 


rnapier host=(root) sudoedit /etc/httpd.conf 


Another major improvement in 1.6.8 is the addi- 
tion of a NOEXEC option.” On operating systems 
that support it,“ the NOEXEC option will prevent a 
command run under sudo from calling exec() itself. 
This will prevent the shell-outs that provide trivial 
root shells from so many commands from editors to 
pagers. Given the newness of this technique, only time 
will tell how effective it is in practice. 


The solution to providing limited sudo is single- 
purpose wrappers, small scripts written to do exactly 
what is required. By providing sudo access to just these 
wrappers, least privilege can be much better achieved. 


For example, let’s consider a script mysqllog, 
which prompts the user for her password, validates it 
against /etc/shadow, and if successful, displays 
/var/log/mysqld.log. This log file is owned by the 
mysql user and group-owned by the mysql group. It is 
only readable by user and group. 


10TSUDO], sudoers man page, ““NOEXEC and EXEC.” 

“This includes at least SunOS, Solaris, *BSD, Linux, IR- 
IX, Tru64 UNIX, MacOS X, and HP-UX 11.x. It does not 
work on AIX and UnixWare. [SUDO] 
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#!/usr/bin/perl -w 


use strict: 
sub error { GH: 2. i 


1£( $> != 0 ) 


print @_; 


{ error "Must run as root\n"; } 


my S$good_pwd = (getpwuid($<))[1] or error($!); 


chomp( my $test_pwd = <> ); 

if (crypt($test_pwd, S$good_pwd) ne $good_pwd) 
exit 1: 

} else { 
exit 0; 


} 
Figure 2: checkpass. 


#!/usr/bin/perl -w 

use strict; 

my Suser = getpwuid(S<); 
print “Password: "; 
system( ’'/bin/stty’, 
my S$password = 4); 
system( '/bin/stty’, 
print "\n"; 


open( CHECKPASS, *|-’, '/usr/bin/sudo’, 
*/home/rnapier/checkpass’ ) 
or die $!; 
print CHECKPASS $password; 
close CHECKPASS; 
if ($7? == 0) { 
system( '/usr/bin/sudo /bin/cat’. 
' /var/log/mysqld.log’ ); 


‘-echo’ ); # Don’t echo 


"echo’ ): + Do echo 


} 
else { 

print "Bad password. \n" 
} 


Figure 3: mysqllog. 


Since mysqld.log is group-owned by mysql, the 
script will need to have access to that group. It also 
seems to need root privileges in order to access 
/etc/shadow. The obvious solution is to simply make it 
a setuid root script, but this would give it far more 
access than it needs. In fact, this script doesn’t actually 
need to be able to read /etc/shadow; it only needs to be 
able to verify that a given username/password combi- 
nation is valid. Carefully stating your privilege require- 
ments is the first step towards achieving least-privilege. 


As before, we’ll create a user, mysqllog, to run 
this script and edit sudoers to give it permission to run 
““checkpass” and “‘/bin/cat /var/log/mysqld.log”’. 


The checkpass script is listed in Figure 2. It reads 
a password for the current user from STDIN. It then 
exits with a 0 to indicate a good password, a | to indi- 
cate a bad password, or a 2 to indicate an error. We 
pass the password in on STDIN because command 
line parameters can be seen in the process table by all 
users. Note that this script can only validate the cur- 
rent user, not an arbitrary user. Once again we keep to 
least privilege. 


The code to perform our task is shown in Figure 
3. We make it setuid to the mysqllog user we created 
earlier. Now arbitrary users can run this script, enter 
their password, and get the contents of mysqld.log. 
Even if an attacker can find a bug in the script, the 
privileges that can be exploited are very limited. 
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Setuid/setgid vs. Sudo 


Setuid scripts can check the calling user to deter- 
mine whether they have rights to run this script (gen- 
erally by checking group membership). This is conve- 
nient for the authorized user, because she is saved the 
trouble of typing “‘sudo.”” An example of such a check 
in Perl is given in Figure 4. 


#!/usr/bin/perl -w 


‘wheel’ ) 

or die; 
if( $( !~ /\bS$group_wheel\b/ ) { 
die( "Must be part of wheel". 

" g¥Foup to. run’ this script: \n""): 


my Sgroup_wheel = getgrnam( 


Figure 4: Checking group membership in Perl. 


Alternately, privilege-requiring scripts can be 
executed using sudo. This has the advantage of pro- 
viding centralized accounting of all privileged users 
and reducing the complexity of the scripts. 


Non-root Sudo 


One should always consider when using sudo 
whether the user needs root access or whether access 
to a non-privileged user like apache or jabber might be 
sufficient. Sometimes changing the ownership or 
group of configuration or log files is enough to allow 
less-privileged accounts to manage them. Be careful 
with this, however. Many services like Apache should 
not be run under a UID that can write to their configu- 
ration files. Doing so could allow a minor compromise 
to be escalated into a larger compromise by allowing 
the server to be reconfigured by an attacker. That said, 
there is no reason that Apache’s configuration files 
can’t be owned by an apacheconf user and administra- 
tors given appropriate sudoedit privileges to that user. 


Setuid/setgid with Sudo 


Setuid/setgid can be combined very effectively 
with sudo. For example, the script can be setgid to a 
special group. This group can then be given root sudo 
privileges to run specific single-purpose wrappers. 
The script can then use sudo to execute these wrappers 
to escalate to root privileges precisely when needed, 
and only for precisely what is needed. Furthermore, 
since the single-purpose wrappers are not themselves 
setuid, they can only be called indirectly, by already- 
privileged processes. This helps prevent an attacker 
from passing them unusual parameters, making them 
less susceptible to security coding flaws. 


As an example, we can achieve the same func- 
tionality as in update _errorlog (Figure 1) on our local 
machine using a setuid script (Figure 5). As in the 
‘“‘Non-root keys” section, we’ll set up an errorlog user, 
including its sudo privileges, and make the script 
setuid to errorlog. When you run update errorlog it 
will then update httpd.conf and restart Apache as long 
as you are in the wheel group. 
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Similar techniques can be used with SSH com- 
mand-keys or CGI scripts that are run under specific 
user IDs. 


Setuid/Setgid Best Practices 
Non-root Setuid 


““Setuid’’ is sometimes confused with “run as 
root,” but this need not be the case. Setuid can be used 
to run a command as any particular user by chowning 
the file to that user. 


As with sudo, always consider whether your 
setuid script really needs root access or just access to a 
special user. For example, if a script needs access to a 
file containing a password, there’s no reason that file 
needs to be owned by root. It could be owned by any 
non-user account. 


Reducing the number of setuid-root scripts 
reduces the number of ways attackers can exploit cod- 
ing errors to obtain root. 

Setgid 

In some cases, setgid can be even more useful 
than non-root setuid. As we saw in the “Non-root 
keys” section, creating a group to manage configura- 
tion files such as for Apache can help isolate access to 
these files from root access. Setgid scripts can grant 
users access to these protected files, while still pre- 
serving the user’s own privileges (such as access to 
their home directory) without any special handling of 
effective UID. 


Setuid Script Obfuscation 


Setuid scripts in languages like Perl and Python 
present a special problem. They have to be readable 
by the user, giving an attacker an opportunity to study 
them looking for security flaws to exploit. The 
attacker may even be able to copy the script to another 
machine to test possible exploits offline. 


Compiled programs do not generally have to be 
readable by the user; they only require that the 


#!/usr/bin/perl -w 
use POSIX qw(strftime) ; 


Napier 


executable bit be set. So when writing setuid and set- 
gid scripts in interpreted languages such as perl or 
python, there is some value to creating a small wrap- 
per in C, as shown in Figure 6. 


fHinclude <stdio.h> 
#tdefine CMD "/usr/local/protected/myscript" 
main(ac, av) 
enar -** av; 
{ 


char error[80]; 


execv(CMD, av); 
snprintf( error, sizeof( error ), 
"Unable to run %s",CMD ); 
perror( error ); 
Geiee I}; 


Figure 6: myscript.c setuid wrapper. 


In the above example, /usr/local/protected should 
only be readable by the setuid user (often root), and 
myscript should be replaced with script filename. 


Keep in mind that this is an obfuscation tech- 
nique, not a security technique. If your script had no 
security flaws in it, then this technique wouldn’t be 
needed and using this technique doesn’t prevent an 
attacker from exploiting your script’s security flaws. It 
just makes finding the flaws harder. 


While languages like Perl and Python have spe- 
cial handling to make setuid scripts “safe” (though 
readable by the user), it is not trivial (or even possible 
on some older platforms) to make Bourne and similar 
shell scripts setuid safely. Most operating systems 
don’t even allow this anymore.” These will absolutely 
require a wrapper script, though it would be wise to 

'2Shell scripts are subject to environment attacks including 
manipulation of PATH or IFS, and timing-based attacks 
based on moving links around between the time that the 
script is started and the script is read. Some of these have 
been addressed in modern versions of UNIX-based operat- 
ing systems, but because of Bourne shell’s reliance on exter- 


my Sgroup_wheel = getgrnam( ’wheel’ ) or die; 


if( $( !~ /\bSgroup_wheel\b/ ) { 


die( "Must be part of wheel group to run this script." ); 


my $file = ’/etc/httpd/conf/httpd.conf’ ; 


my S$date=strftime("%F", localtime) ; 


i£( --e "S{file}.bak") { unlink( "${file}.bak" ) or die "S$!" } 


rename( $file; "S$file.bak" ) or die "$!"; 
open( INFILE}) "“$file.bak” ) or die "$1"; 
open( OUTFILE, ">S$file" ) or die. "31" 


while( <INFILE> ) { 


s!*ErrorLog(.*)!ErrorLog /var/log/error_log.S$date!; 


print OUTFILE; 
} 
close INFILE or die $!; 
close OUTFILE or die $!; 


system( qw(/usr/bin/sudo /usr/sbin/apachectl restart) ) == 0 or die; 
Figure 5: update_errorlog in Perl. 
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write setuid scripts in a language like Perl or Python in 
any case. Bourne shell and its cousins rely very heav- 
ily on the operating environment and external pro- 
grams and so are much harder to adequately secure in 
a setuid context. 


Perl in particular provides taint checking, which 
helps programmers keep track of what data could have 
been influenced by the user. Setuid Perl scripts auto- 
matically turn on taint checking. If you use a C-wrap- 
per as above, Perl will no longer automatically turn on 
taint checking, so you should do so by passing ‘-T” 
in the perl invocation. 


Finally, whenever possible, make setuid and set- 
gid programs unreadable by anyone but the owner. 


Best Practices in Handling Privileged UID/GIDs 


Ideally all “special” UIDs and GIDs should 
have a consistent numbering convention. Generally, 
numbers under 100 (or 1000 for larger systems) 
should be reserved for these special IDs. 


Special UIDs should not generally permit direct 
login. They should not have a valid password or shell. 


It is often convenient for administrative staff to 
belong to special GIDs so that they can manage con- 
figuration or data files directly without needing further 
access (such as sudo). This is particularly useful for 
allowing non-root users to administer particular parts 
of the system. 


Odds and Ends 
Sticky Bits 


Setting the sticky bit on a directory allows users 
to write files that other users cannot remove, even 
though the directory is world writable. In some cases 
this can get rid of the need for setuid user scripts to 
write protected files. /tmp is a good example of where 
this is used. Setting the sticky bit is done as follows: 


chmod ott directory 


Setgid Directories 


Setting a directory setgid will cause files created 
there to belong to the same group as the directory 
rather than the user’s primary group. In some cases 
this can get rid of the need for privileged daemons that 
need to read things created by users. 


Combining this with the sticky bit is an effective 
way to create a drop-box location for a non-privileged 
daemon. Users can put things into this directory, but 
they can’t list the entries in the directory (since we 
won't add the directory read privilege), and they can 
only remove their own files (because we’ll set the 
sticky bit). 

Create a directory “drop” and set the sticky and 
setgid bits: 


nal programs for most handling, it is very difficult to protect 
yourself from all of them. Most modern UNIX-based operat- 
ing systems do not permit setuid shell scripts. See the UNIX 
FAQ Question 4.7 at http://www.faqs.org/faqs/unix-faq/faq/ 
part4 for more information. 
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## chown myservice:myservice drop 
# chmod u=rwx, g=rwxs,o=wxt drop 


This means that anyone can drop files into the drop 
directory, users can’t modify each other’s files and the 
myservice user doesn’t need any special privileges to 
manage files dropped into this directory. 


Web Applications 


By default, web applications run as the web user. 
On a system with multiple web applications, each 
application will have access to the others’ data in this 
configuration. By creating separate accounts for each 
application, they can then be separated using suexec, 
an Apache feature that causes CGI programs to run 
under user accounts rather than as the web server. This 
can protect application data from exploits against 
other CGI programs as well as the web server itself. 


Recommendations for the Future 


There are several features that would help large 
installations manage sudo and SSH better. Some of 
these are currently possible with custom work, but 
they should be integrated better into the products. 

¢ SSH should be able to get its authorized _keys 
information easily from LDAP or Active Direc- 
tory rather than the user’s .ssh directory. This 
would allow the list of authorized keys to be 
protected from NFS attacks without resorting to 
local configuration files on each host (as 
described in “‘Hardening and Managing SSH’’). 
To maintain least-privilege, LDAP keys should 
be assignable to specific servers (rather than 
being globally accepted), and would need to 
allow restrictions such as “from” and ‘‘com- 
mand”. The existing SSH LDAP solutions’ 
allow X.509 certificate authentication of users, 
but do not fully replace the authorized_ keys file. 
SSH needs better integration with one time pass- 
words (OTP). In particular, it should be possible 
to configure keys that allow an interactive shell 
to require one-time passwords while not requir- 
ing this for command keys. Interactive shells are 
extremely powerful and should always have a 
human available to enter the OTP. Command- 
keys are very restricted and generally won’t 
have a human available to enter the OTP. This is 
difficult to implement with the current SSH 
tools, which generally requires all-or-nothing 
use of OTP. 
Sudo needs to be able to get sudoers configura- 
tion from LDAP. This would make it much eas- 
ier to integrate with a Role Based Access Con- 
trol (RBAC) system or other centralized account 
and authorization systems. It is currently possi- 
ble to generate a sudoers file out of LDAP with 
custom tools, but this is cumbersome and creates 


SL DAP is managed by the “Certificate Authentication” 


feature of commercial SSH and the “OpenSSH LDAP 
Public Key Patch” (http://\dappubkey.gcu-squad.org/) for 
OpenSSH. 
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a delay between when LDAP is updated and 
when the change takes effect. 

e A tool that would automatically determine trust 
relationships created by sudo and SSH and dis- 
play this in a consolidated format (such as a 
directed graph) would be extremely valuable. 

e Sudo should use TTY tickets by default and 
optionally clean up old tickets automatically 
whenever sudo is run. 


Availability 


OpenSSH is freely available under the BSD 
license from the OpenBSD project. Their website is 
available at http://www.openssh.org. At the time of 
this writing, the currently available version is 3.8. 


SSH Secure Shell, discussed in this paper, has 
been replaced by SSH Tectia. Both are commercial 
products available from SSH Communications Secu- 
rity (http://www.ssh.com). Where this paper refers to 
the commercial product, it is written to SSH Secure 
Shell version 3.2.9. At the time of this writing, the 
most recent version is SSH Tectia 4.1. 


Sudo is freely available and maintained by Todd 
Miller (Todd.Miller@courtesan.com) at http://www. 
courtesan.com/sudo. At the time of this writing, the 
most recent version is Sudo 1.6.7p5, though some fea- 
tures of the upcoming 1.6.8 are discussed in this paper. 


Conclusion 


In this paper we have established the importance 
of the principle of least privilege to the overall security 
of an environment, by reducing the avenues of attack 
and the extent that any particular attack can compro- 
mise the system as a whole. We have discussed prob- 
lems with the techniques that may currently be used in 
many environments including unrestricted SSH keys 
for automation tools and setuid tools with excessive 
privileges. Finally we have provided techniques and 
examples of how to apply least privilege to real-world 
automation problems, including restrictions on sudo 
and SSH, wrapper scripts, setgid and sticky directories. 
While these techniques are useful and important, even 
more important is the philosophy behind least privilege. 
By constantly asking ourselves what the minimum set 
of privileges a particular operation needs, and challeng- 
ing ourselves to reduce and compartmentalize those 
privileges, the security of our environments will not 
only improve, but become pervasive. 
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ABSTRACT 


One ideal of configuration management is to specify only desired behavior in a high-level 
language, while an automatic configuration management system assures that behavior on an 
ongoing basis. We call a self-managing subsystem of this kind a closure. To better understand the 
nature of closures, we implemented an HTTP service closure on top of an Apache web server. 
While the procedure for building the server is imperative in nature, and the configuration language 
for the original server is declarative, the language for the closure must be transactional; i.e., based 
upon predictable and verifiable atomic changes in behavioral state. We study the desirable 
properties of such transactional configuration management languages, and conclude that these 
languages may well be the key to solving the change management problem for network 


configuration management. 


Introduction 


HTTP servers are complex applications. In craft- 
ing a valid configuration file for a server such as 
Apache, there are many options, often with cryptic 
names and unclear meanings. Choices for many 
options seem not to matter to the end-user, e.g., the 
exact locations of content hierarchies within the 
filesystem. Choices for other options have critical 
effects, such as whether to allow CGI programs within 
a particular directory to execute. Simple typos in the 
configuration file can be difficult to locate and have 
unpredictable results. Thus, to assure reliable service, 
many configuration changes must be made by an 
experienced system administrator. 


We manage an HTTP server cluster where the 
majority of responsible system administrators and 
content providers have historically been relatively 
inexperienced students. As a result, there has been 
considerable service downtime due to misconfigura- 
tion of the server, giving content files inappropriate 
names or MIME types, inappropriately protecting con- 
tent, and even allowing servers in the cluster to differ 
in configuration. 


Content providers often make serious errors in 
naming files and setting permissions for HTTP con- 
tent. Either content is protected too restrictively to be 
available, or content protections are permissive 
enough to pose a security risk. Users also have diffi- 
culties ensuring that files have extensions that match 
their content. Typical examples include inadvertently 
exposing private contents of scripts by giving them 
incorrect extensions or filing them in an inappropriate 
directory, or making content directories world- 
writable, thus posing a security risk. 


Naive editing of HTTP configuration files can 
also cause unexpected and costly downtime for web 
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servers. When many virtual domains inhabit one 
server, one configuration error can be catastrophic, as 
it will bring down all of the domains. In practice, it 
can take a lot of time to recover from an error if 
changes have not been carefully tracked. At times, 
low-traffic virtual domains have been down for weeks 
due to undocumented changes that had hidden effects. 


An HTTP Closure 


We seek to solve this problem by surrounding 
our HTTP service with a closure, as described in [17]. 
A closure is a self-managing component of an other- 
wise open system. We intended our closure to: 


1. Allow relatively untrained users to reliably cre- 
ate relatively complex configurations involving 
virtual domains and aliases. 

2. Allow reliable creation and deletion of virtual 
domains. 

3. Ensure appropriate protections and MIME 
types for content. 

4. Protect against unauthorized changes to content 
or configuration. 

To accomplish this, we had to make a radical departure 
from the way HTTP servers are usually administered. 


At the beginning, we were inspired by several 
related projects. DryDock [20] is a content-manage- 
ment system that allows the submission of web pages 
to a web server, after they have checked by a human 
being. While it provides some desired features, includ- 
ing content validity checking, DryDock is more of a 
content checking and approval method than a web 
server configuration tool. TemplateTree II [30] can 
help configure the webserver configuration file, by 
automatically filling in blanks in a pre-determined 
template, but seems to stop short of being able to han- 
dle advanced features such as virtual domains. Each of 
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the virtual domains requires the addition of a new tem- 
plate to the file. Charlie [38] is a content-mapping web- 
server in which URL’s are explicitly mapped to files by 
a declaration file, and service is not determined by 
filesystem structure. Our initial goal was to combine 
these ideas into a reasonable HTTP closure in which: 
1. Content is specified by mappings between 
URLs and files (Charlie). 
2. The appropriate configuration is generated 
from intermediate data (TemplateTree II). 
3. Content is checked for type validity before 
being published (DryDock). 


Our hope was that the resulting synthesis would 
be easier to use than any of its predecessors. 


Recently, others have attempted part of this 
process independently. The Virtualmin [9] environ- 
ment within the Webmin [8] web-based administrative 
environment solves the problem of defining virtual 
servers neatly, but does not deal with the problems of 
content management and assuring that content is pro- 
vided with correct MIME types, etc. Both Virtualmin 
and Linuxconf [19] support dynamically changing the 
modules that are loaded into Apache, though to our 
knowledge they do not address the dependency issues 
we discovered in trying to accomplish this. A third 
management environment, Comanche [32], was 
unavailable to us except in source form at time of 
writing, and we lack knowledge of its capabilities. 
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Figure 1: Typical interactions between a user/admin- 
istrator and web server, where arrows indicate 
information flow. 


HTTP servers are typically administered through 
direct access to configuration files and documents 
(Figure 1). One edits a server configuration file 
directly and then places content directly into directo- 
ries that the server should expose to the outside world. 
The configuration file serves not only as a means of 
control, but also as documentation of defaults and 
other performance characteristics of the web server. 
Without fairly complete knowledge of the contents of 
this configuration file, it can be difficult to publish 
content. Since the document repository is edited 
directly, users must also have a good grasp of file pro- 
tections within the server environment. Current 
approaches to this problem include simplified graphi- 
cal user interfaces that expose only part of the capabil- 
ities of the configuration file [9, 32]. 


The configuration of the Apache web server is 
described by the contents of several files (Figure 2), 
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where arrows indicate couplings between data in dif- 
fering files or structures. In order for the server to 
respond correctly to a request, several parts of the con- 
figuration must agree in intent. For example, in the 
figure, to answer the request for URI http://www.foo.edu/, 
it must be true that: 


httpd.conf Filesystem 
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Figure 2: Configuration constraints required in order 
to answer a request properly on an Apache web 


server, where lines indicate required agreements 
between data values. 








. www.foo.edu is a valid virtual domain. 

. www.foo.edu maps to the server’s address. 

. WwW .foo.edu’s content is stored in the directory 
Isome/where. 

4. It is permissible to publish the content of the 

directory /some/where. 

5. The request is for a directory rather than a file. 

6. In returning data for a directory, one first 
checks for an “index file” that represents direc- 
tory content. 

. index.html is an appropriate file. 

. index.html exists in the directory /some/where. 

. index.html is readable to the web server. 

. index.html has MIME type text/html because of 
the .html extension. 


WN 


Coo tO ~~) 


If any of these assertions is not true, the request 
fails. In the figure, many of these constraints are indi- 
cated by lines between configuration data that must 
agree in value. 


The net effect of this scheme of constraints is that 
an Apache server can be quite difficult for a novice to 
administrate. There are several reasons for this: 

1. The effect of a particular declaration depends 
on other declarations; one needs to understand 
the global configuration in order to understand 
the effect of a local declaration. 

2. The configuration language — in an attempt to 
be easy to type — is filled with seemingly con- 
venient defaults that make a configuration file 
difficult to interpret. 

3. Often several distributed declarations determine 
whether content is provided correctly. For 
example, in order to serve a directory, one must 
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specify its protections as a directory, its map- 
ping as a URL, and its MIME type mapping (if 
different from the default). 


To simplify this process, we took direct control 
of content directories and configuration file away from 
the user. These are instead controlled by an interven- 
ing layer that mediates between user and server (Fig- 
ure 3).' The user interacts directly only with an image 
of the document repository and a command inter- 
preter. This interpreter keeps track of its state and 
maintains a private document repository of its own to 
which a user does not have access. User commands 
cause copying from the user’s space into the closure’s 
space in order to publish a document. 


It might seem that we have just made the process 
of publishing more difficult, but in fact we have made 
it much less error-prone. At several steps during the 
publishing process, validity checks are made to the 
document and configuration requests. These checks 
include: 

1. Does the name of each virtual domain correctly 
map to a valid interface via DNS? 

2. Does each file’s MIME type roughly agree with 
its content, as indicated by file magic number- 
ing? 

3. Do HTML files contain correct HTML? 

Once a document passes these validity checks, it 
is published reliably, because the closure will take care 
of placing it in a proper location and protecting it so 
that the web server can see it. Also, a validate com- 
mand checks that the web server configuration and all 
content have not been edited by unauthorized people, 
by comparing cached MDS checksums against check- 
sums of current data. 


This closure consists of three components: 

1. A setup script that initializes the closure on a 
prebuilt server. 

2. An agent that interprets the command language 
and makes changes in configuration over time. 

3. A command language for describing changes to 
make in service. 


‘Some authors would call this “middleware.” “Middle- 
ware” is perhaps one of the most abused terms in the mod- 
ern computing lexicon. As we provide not an API but in- 
stead a message-passing interface, the term “middleware” 
does not seem to accurately apply. 
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Experience in Implementing an HTTP Service Closure 


We chose to place our closure around an Apache 
web server running inside RedHat Linux 9.0. We 
assumed that the underlying system would be newly 
built and functioning on the network prior to invoca- 
tion of the closure. In an actual closure, all systems 
with which the closure will interact must also be clo- 
sures, in order to maintain overall integrity. Since this 
was our very first closure (which, in hindsight, was 
probably too ambitious), we had to settle for interact- 
ing with an already functioning system. 


The initial “build script” determines a few 
aspects of pre-existing system configuration, such as 
where the Apache web server is located, whether cer- 
tain applications are installed on the system, and other 
vital data necessary in order to construct the closure. 
The script then creates a managed structure on the disk 
that has restricted permissions. This structure contains 
a startup/shutdown script for the web server, a docu- 
ment root, a (private) space for storing closure logs 
and data files, and a default configuration. Data files 
include definitions of virtual domains, access rights 
for users and directories, MD5 checksums of configu- 
ration and submitted files, and boilerplate configura- 
tion segments describing defaults. Internal data is 
stored in the Perl Data::Storable format. 


After the build script does its work, a command 
interpreter takes commands and modifies the resulting 
service automatically. This interpreter is a perl script 
that acts as a command line interface. This interface is 
responsible for all ongoing management of the HTTP 
service. In order to make the closure easy to use for 
both system administrators and end users (who may 
not be familiar with computer-related concepts), we 
decided to create a very limited command set that 
should be able to provide all the functionality neces- 
sary for a typical multiple-domain server. 


Most of this process is straightforward; difficul- 
ties arose mainly in designing an appropriate command 
language with which to converse with the closure. 
Coming up with an appropriate language took consid- 
erable thought and required nontrivial changes in the 
way we think about server configuration as a process. 


Command Language 1.X 


Our initial command language was _ patterned 
after the structure of Apache’s configuration file 


| Webserver server 





Http request 







Sra 


Web server 
repository 


Http 
/ response 





\ 
\ 





Figure 3: A closure mediates between the user and the Reside configuration, reducing complexity of the configura- 


tion process. 
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httpd.conf. We decided upon a minimal language with 
typical commands like the following: 


assert foo.bar.com 
declares a virtual host and readies it to serve content. 
retract foo.bar.com 


removes a virtual host from the server, along with all 
content that it provides. 


post /home/prod 
http://foo.bar.com/products 


makes the directory /nhome/prod on the current machine 
appear to be the web directory http://foo.bar.com/products . 


post /home/couch/foo.html 
http://foo.bar.com/foo.html 


makes the file /home/couch/foo.html appear as the URL 
http://foo.bar.com/foo.html. Each URL is associated with 
a unique directory in a private managed space, and 
each posted file or directory content is copied there to 
isolate it from further changes by the user. The type of 
the first argument — file or directory — determines what 
post does. In the case of a directory, post copies all 
subdirectories recursively. 


retract http://foo.bar.com/products 


removes any association between that URL and a con- 
tent directory. It erases content previously associated 
with the URL via a post command. 


retract http://foo.bar.com/foo.html 


removes the mapping that results in content for the 
above URL. As described in [17], a web server is a 
mapping between URLs and content; the user need 
only specify that mapping and the closure takes over 
to assure it. 
Ambiguity 

At this point, progress on the project ground to a 
halt due to seemingly insurmountable difficulties 
within the command language. Ambiguity arose in the 
command language as a direct effect of defaults in the 
underlying httpd.conf, as well as default behavior for 
the corresponding HTTP protocol. This ambiguity has 
several effects: 

1. The causal effect of many commands is unclear 
or determined by context. 

2. One needs to understand the history of all com- 
mands in effect to know exactly what a single 
command will do. 

3. It is possible for one command to override part 
of another, so that the resulting state cannot be 
reached via any other sequence. 

As the simplest example, consider: 
post /home/foo.html to 
http://www.foo.com 

Should this make /home/foo.html available as 
http://www.foo.com/foo.html or as http:/Mwww.foo.com? In 
the latter case, should the file /home/foo.html be 
renamed to index.html or left alone in the directory? 
Then consider: 
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post /home/foo to http://www.foo.com 


If /nome/foo is a file, this has a potentially different 
effect than if /home/foo is a directory. But we cannot 
know which it is from the command itself. Finally, 
consider: 


post /home/index.html to 
http://www.foo.com/index.html 

post /home/index.cgi to 
http://www.foo.com/index.cgi 


Which of these will be the directory index? The result of 
answering these questions in any reasonable way was that 
the effects of the seemingly simple command language 
were hideously complex to document and understand. 


Creating content for a web server requires com- 
plete knowledge of its conventions, including which 
filenames have special meanings or interpretations. 
These defaults are set in httpd.conf. Without complete 
knowledge of the defaults, the user cannot create content 
properly. By abstracting the defaults into a command 
language rather than a file, we made the defaults invisi- 
ble, and thus rendered the problem more difficult than 
before. We concluded that we had not simplified the 
problem of managing httpd.conf; we had actually made 
the management process more difficult than before! 


Appropriate Closure Language 


The key to these quandaries was to carefully 
describe desirable attributes of the command language 
and then redesign to these requirements. But we did 
not understand the optimal properties of such a com- 
mand language, and only had the httpd.conf format as 
an example. Its properties include: 

1. Minimization of ink: all that is unspecified has 

(reasonable) defaults. 

2. Hierarchy: everything is laid out in a carefully 
designed multi-level hierarchy. 

3. Scoping: the intent of commands depends upon 
the context in which they are entered. 

4. Ordering: ordering of certain commands, 
including protections, changes intent within the 
configuration file. 


These properties ease the loop of interacting 
directly with the configuration file, but do not ease the 
process of incrementally describing a configuration 
through individual and atomic commands. The design 
of httpd.conf presumes that the administrator has global 
knowledge of the contents of the whole configuration 
file. Our closure users, operating from outside the clo- 
sure, have no such knowledge. 


In effect, the syntax of httpd.conf had corrupted 
our thinking. Used to being able to look at the whole 
file to answer questions, we presumed that our com- 
mands could be patterned after the edits we make to 
the file and the file copying that we would do without 
the closure in place. This patterning made configura- 
tion more difficult rather than easier. In fact, the exact 
conveniences and defaults — that make httpd.conf easy 
to use when one is editing it directly — make a transac- 
tional language difficult to use. 
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To be easy to use, our command language has to 

have several somewhat different properties: 

1. Clarity: the intent of a command should be 
immediately clear from its form. 

2. Independence: to the extent possible, com- 
mands should be independent of one another to 
avoid conflicts, ambiguities, and difficulties in 
determining effects. In particular, global 
defaults should not be subject to change. 

3. Declarative syntax: Each command represents a 
state to preserve, rather than an action to perform. 
a. Commands should be idempotent, i.e., 

repeating a command twice in a row 
should have the same behavioral effect as 
doing it once. 

b. Commands should be stateless, i.e., the 
behavioral effect of doing a command 
should not depend upon prior executions 
of the same command, even if these execu- 
tions occurred in the remote past. A state- 
less command is always idempotent; state- 
lessness is a stronger condition. 

Reducibility to assertions: Either an assertion is 

in effect or not; there is no such thing as being 

“1/2 in effect.” A command that conflicts with 

a previous command undoes (“retracts”) the 

effect of the conflicting command. Equiva- 

lently, any sequence of assertions and retrac- 
tions is equivalent with a subsequence consist- 
ing of assertions alone. 

4. Representability: at any time, one should be 
able to get an idea of all commands currently in 
effect, to understand global contents. Ideally, 
the representation should be a conflict-free 
description of the current state of the service, in 
terms of the unordered list of commands cur- 
rently in effect. In particular, the representation 
of service should be free of retractions. 


These properties actually arise from mathemati- 
cal models that we will describe later. For now, it suf- 
fices to mention that statelessness and reducibility to 
assertions imply representability. The remainder of the 
requirements, clarity and independence, contribute to 
ease of use. 


Statelessness means that a command does not 
depend upon prior invocations of itself to do its work. 
Stateless commands cannot be incremental in nature, 
but must deal with absolute quantities. For example, 
incrementing a counter is not a stateless command. 
The reason that statelessness is important is that the 
user may not have knowledge of prior commands or 
pre-existing configuration. While a stateful command 
may have indeterminate results, a stateless command 
has (roughly) the same effect regardless of when it is 
executed. The user need not remember anything in 
order to know what its effect is. 


Representability means that at any time, one can 
describe the closure via the commands that are currently 
in effect, which is typically a smaller list than the whole 
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sequence of commands since the closure was created. 
Reducibility to assertions means in addition that the 
commands currently in effect do not have to contain 
retract statements, because conflicts cause conflicting 
assertions to completely disappear from a representation. 


Command Language 2.X and Beyond 


These considerations caused subtle but profound 
changes in our command language that both resolve 
ambiguities and make it easier to use than making 
manual changes to configuration files and web con- 
tent. We are currently in the process of implementing 
these changes. 

1. Commands either augment or retract the effects 
of other commands. The effect of issuing a con- 
flicting command is to retract the commands 
with which it conflicts. To assure this, we dep- 
recated using post for files (except for indexes), 
and required its use on directories, where it 
recursively applies to subdirectories. In version 
1, retracting a directory does not retract its sub- 
directories; in version 2, subdirectories are 
retracted as well. 

2. When overriding the MIME type for a specific 
file, the command does not affect other files 
with the same extension. In version 1, a MIME- 
type override applied to all of the files of that 
type in a folder. This caused confusing changes 
in default MIME-types when new files were 
uploaded. To assure clarity of intent, the over- 
ride is now specific to each single file. 


Other profound changes are yet to be imple- 
mented. The above ideals for language imply that the 
indexing process for a directory should be independent 
of its content. One should be able to specify an index 
for an empty directory, or have a directory with no 
index. In the latter case, one gets a “permission 
denied” error instead of a directory listing. This effect 
is accomplished by adding the special keyword index 
to a post command for the index file. 


Thus we plan to change the concept of index 
from being a file with the word index as its name, to 
being a file with any name that just happens to be 
bound to a directory as a listing operation. This index 
file can have any name, and be of any file type. This 
change provides a significant increase in flexibility 
over the old style, while disambiguating the most frus- 
trating of problems when faced with the problem of 
insuring consistent operations. If no index file is 
selected, then either an error page is returned by 
default, or the security on the system can be made 
more lax and allow one to display the contents of the 
requested directory. 

Critique 

Our current closure does its intended job well, 
but there are many shortcomings. While several are 
simply new features to be implemented, some require 
a relatively deep rethinking of how we interact with 
systems that provide web content. 
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First, while the intent of a configuration can be 
represented by a list of commands, repeating the com- 
mands will not produce the exact same HTTP service. 
There is no guarantee that the source directories have 
not changed in the meantime. Repeating the same set 
of commands that created the service will instead 
result in an updated service based upon changes in the 
source directories. Indeed, what we seem to have 
implemented is the opposite of a content-staging sys- 
tem such as DryDock; we included no protection 
against inadvertent changes of the source repository. 


Many issues for dynamic content have yet to be 
resolved. Various programs, such as Gallery [39], cre- 
ate and/or modify files within the web hierarchy by 
themselves, which constitutes performing operations 
outside of the closure. This kind of interaction is not 
supported in our model, and though it is tolerated, per- 
haps should not be allowed at all. 


The closure deals very poorly with dynamic con- 
tent stored within the document repository. Repeating 
commands used to set up a repository will erase any 
dynamic content created in the meantime within the 
repository by the action of CGIs. This content is not 
even accessible to the user unless exposed by the web 
server, and any files posted via the closure will be 
automatically made read-only. 


It could be argued that this fascism is not a bug, 
but a feature; it strictly enforces utilizing external 
databases to store dynamic content rather than local 
server files. This is indeed the “best practice” for 
managing dynamic content, according to many web 
programmers. 


In fact, we can find no reasonable solution to 
supporting this behavior of CGIs. If we allow dynamic 
content in document directories, it must be protected 
from subsequent post commands (that, in normal oper- 
ation, will delete that content). But if we ignore such 
files, then the effect of post is not stateless. 


Ideally, CGIs should use external data sources 
for dynamic content. The couplings between CGIs and 
external data sources should be managed by the clo- 
sure itself, though the best language for accomplishing 
this is unknown. Another rather deep question is how 
to handle enforcing integrity constraints for data 
sources outside the closure, such as databases utilized 
by CGI scripts. If we do force all programmers to uti- 
lize databases, how do we assure that their scripts bind 
to functional and allowed data sources? While the 
Microsoft .NET framework makes this kind of check- 
ing easier by separating data source bindings from 
programs, no such solution exists for Linux and 
Apache. 


Finally, this closure was our very first closure, 
and thus had no other closures with which to converse 
(unlike the ideal web service closure described in 
[17]). Thus our closure must check for external depen- 
dencies itself, and cannot correct deficiencies that it 
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finds in its environment. For example, the assert com- 
mand checks whether the domain being asserted 
points to this machine in name service, but cannot 
assure that by modifying the name server. Instead, it 
must refuse to perform the assert. 


Future Work 


Plenty of additional work needs to be done on 
this prototype before it is appropriate for production 
use. Among the most obvious extensions is to be able 
to handle ssl (HTTPS) traffic as well as HTTP. This 
will require improving how the closure deals with 
constraints. One tricky part of handling ssl, for exam- 
ple, is that one must enforce the constraint that only 
one virtual server bound to each address can have ssl 
capabilities. Currently, one must explicitly disable one 
ssl instance before asserting another. 


Separating indexing from directory contents is a 
bit tricky, as Apache is designed to couple them 
together. Currently the closure simply uses whatever 
index file is present in each directory. Implementing 
the ideal indexing scheme — in which indexing is com- 
pletely separate from directory contents — requires 
special care to avoid naming conflicts. This can be 
handled by storing the index in a name that will not be 
used otherwise (and cannot be easily entered as a 
URL), such as ‘‘-->index<--.__html__”’. To avoid confu- 
sion in choosing an extension matching the MIME 
type of the index file, this file is a simple CGI script 
that will read the real index content from a second 
cryptic filename ‘‘-->content<--.___html__’’, and display 
the contents, tagged with the appropriate MIME 
header. 


There are also problems in dynamically manag- 
ing the modules that Apache loads and requires. A 
true closure would need to analyze what the user 
desires from the web server and load the necessary 
modules by creating a dynamically generated list. We 
thought we could look into the directory where the 
modules are located and instruct Apache to load every 
one it finds as a starting step. We discovered quickly 
that certain modules are dependent upon others, that 
the order in which these modules are loaded is impor- 
tant, and that some modules conflict with and preclude 
the use of others. For example, mod_proxy_ftp.so 
requires mod_proxy.so to be loaded first or else loading 
will fail; likewise loading mod_dav_fs.so will fail if 
mod_dav.so is not loaded first. Using the current 
scheme for loading modules, there is no way to know 
in advance what dependencies, if any, a module actu- 
ally has without trying to load it. 


As a temporary solution, the list of modules to be 
loaded had to be made static, along with the order in 
which they are loaded. However, this means that the 
functionality of the closure is currently severely lim- 
ited, as there are no commands to the closure that can 
expand the capabilities of the server. 


Another ongoing issue is rights management. 
Currently, a user can only be granted rights to edit a 
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domain. In addition, we should be able to permit users 
to update sections of a domain without having rights 
to other sections. For example, we have started writing 
code that will allow couch to edit http:/\www.foo- 
bar.com/research, but restrict his access to other areas 
of the domain. 


Other low level issues, such as insuring that there 
is adequate disk space for web content, need to be 
addressed, though this may need to be performed by 
talking to a disk closure that has yet to be written. 
Since varying devices have different access speeds, 
this could be a future concern, as some content may 
need to be delivered at a much higher quality of ser- 
vice (QoS) than other content, and one would want 
those pages to be stored on faster access devices. 


Lastly, we would like the application to be able 
to handle not only a stand-alone web server; it should 
also be able to scale to a grid or cluster type configura- 
tion, which brings a whole new level of difficulties 
and questions, but ultimately a far more powerful and 
desirable closure. 


Theoretical Background 


While we were struggling with the implementa- 
tion of the prototype closure, another struggle was 
going on at a different level. Clearly, our language 
evolved into something relatively useful from some- 
thing relatively useless. But why do the above lan- 
guage principles work, and what mathematics under- 
lies the design decisions we made? In this section, we 
explore some of the mathematical underpinnings of 
closure language design, and tie this work into other 
work on languages for configuration management. For 
the non-mathematically inclined, this section can be 
skipped without loss of continuity. 


An overview of the results of this section is 
shown in Figure 4. Statelessness of individual com- 
mands leads to idempotence of sequences of com- 
mands. The ability to remove retractions of commands 
from a sequence and retain equivalent effect is called 
“reducibility to assertions.” This property, in combi- 
nation with a semantic model that determines pre- 
ferred order of operations, makes a sequence of com- 
mands declarative in character. Statelessness of com- 
mands, reducibility to declarations, and a one-one cor- 
respondence between configurations and behaviors 
give rise to Cfengine-like convergence of the declara- 
tions, thought of as an operator upon configuration. 


There is currently much controversy about 
whether host configuration languages should be 
imperative [24, 34, 35] or declarative [1, 4, 5, 6, 7, 10, 
22, 23, 26, 27, 33]. A subset of a language is ““impera- 
tive” when it describes procedure or process: ‘‘what 
should be done”’ as an interpretable set of instructions. 
A subset of a language is ‘declarative’ when it 
describes “what the result should be” without specify- 
ing the method or procedure with which this result is 
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accomplished. For example, saying “the car must be 
blue” is a declarative statement, while “paint the car 
blue” is an imperative procedure for assuring the truth 
of the declarative statement. A particular language can 
exhibit both properties, specifying some things imper- 
atively and others declaratively. 





Properties Properties Semantics 
of commands of sequences 
Idempotence | 
ofcommands : 
Stat I | Reducibility | | Known order 
pa : ages | ofsequences |  Ofoperations 
a ' toassertions, | 
A 
| + 
Sequence tay er 
idempotence | Reducibility 
of commands; Of Sequences 


to declarations | 
A One-one 


correspondence 
between 

configurations 

and behaviors 


Reducibility 
of sequences 
to convergence | 






Declarations are 
convergent as 
behavioral 
operator 


Figure 4: Map of theoretical concepts and their rela- 
tionships, where arrows indicate logical implica- 
tion. 





Tools support and encourage either imperative or 
declarative thinking. Proponents of the “imperative” 
tools point out that specifications for these tools are 
close to the way a human would manually configure a 
system, while converting human instructions to declar- 
ative language requires some reverse-engineering 
[24]. Proponents of the “declarative” tools point out 
that mechanism is not important; one should specify 
results, not mechanism, and specifying anything more 
limits flexible response of the configuration manage- 
ment tool to changing requirements [1, 17]. In prac- 
tice, a majority of seemingly declarative languages for 
configuration management allow the direct execution 
of imperative code as an option when declarative 
mechanisms fail to be expressive enough [4, 5, 6, 7, 
1); 13, 22, 23,331. 


While imperative and declarative mechanisms 
both work well for creating an initial configuration, 
neither imperative nor declarative mechanisms proves 
sufficient to implement a closure as described in [17]. 
In both paradigms, there are serious problems in deal- 
ing with changes in intent over time. Imperative man- 
agement mechanisms suffer from script complexity; it 
is difficult to make changes in these “build scripts” 
without mistakes [13]. Likewise, undisciplined 
changes to declarative specifications can lead to unin- 
tentional heterogeneity within large networks [17]. 
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Shortcomings of Imperative Scripts 


Imperative mechanisms substitute order and 
reproducibility for understanding of internal depen- 
dencies. The script that builds a host is constructed 
through intensive validation of its behavioral effects, 
but what the script actually does typically remains 
poorly understood. This leads to a change control 
problem as the script is reused over a long lifecycle. 
Due to lack of understanding of internal dependencies, 
such scripts can only be safely modified by adding 
stanzas to the end [24]. Re-imaging a host requires 
cycling through all of its historical states, including all 
errors in configuration made by previous scripts. The 
only alternative is to start over from scratch and vali- 
date a new script from the beginning. 


The reason that “order matters’ within the 
imperative paradigm is that the implicit preconditions 
for each stanza of a script are that all prior stanzas 
have been executed in order. ISConf enforces this 
order for hosts that miss an update by resuming stanza 
execution exactly where the host last stopped execut- 
ing them, executing missed stanzas before new ones. 
In this way, the host passes through a sequence of 
reproducible states, so that a final desirable state is 
assured. If ISConf allowed hosts to skip stanzas, 
scripts would break unpredictably, because the scripts’ 
preconditions would not be assured on hosts on which 
stanzas were skipped. 


Shortcomings of Declarative Languages 


Further, careless use of declarative tools can lead 
to exactly the kind of unpredictable heterogeneity that 
the imperative tools like ISConf are designed to avoid. 
Consider configuration elements A, B, and C with ini- 
tial values A = a,B = b,C =c and configuration files 
(declarations) d, dz dc. Suppose that 

1. dy sets A =’ and leaves all else alone. 
2. dz sets B = b’ and leaves all else alone. 
3. dc sets C=’ and leaves all else alone. 


Suppose that at any time, a distinct subset of 
hosts is down (unreachable). At the end of applying 
d,,dz.dc in sequence, there are now eight kinds of 
hosts in the network: 

|. A=a',B=b',C=c’': Up during dy, dz, dc. 
2.A=a,B=b',C=c’': Down during d,; up dur- 


ing dp, dc. 

3. A=a',B=b,C=c’': Down during dg; up dur- 
ing d4,dc. 

4, A=a,B=b,C=c': Down during d4,dg; up 
during de. 

5. A=a',B=b',C=c: Down during dc; up dur- 
ing dy, dp. 

6. A=a,B=b',C=c: Down during dy,dc; up 
during dz. 

7. A=a',B=b,C=c: Down during dg,dc; up 
during dy. 


8. A4=a,B=b,C=c: Down during dy, dz, dc. 


As time goes on, the unintentional heterogeneity 
gets worse, a factor of two at a time, every time a sta- 
tion is unavailable for an update. 
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Two Principles 


The above observations can be summarized as 
two related principles of configuration management 
that apply to any such process: 


Principle 1 Once one controls or manages a 
thing, one cannot forget over time that it is con- 
trolled or managed.? 


For example, “forgetting” that A is managed above 
leads to a heterogeneous population of hosts with two 
differing values of A [18]. More generally, 


Principle 2 The discipline with which one 
changes a declarative configuration file is as 
important to effective configuration management 
as the accuracy with which the file expresses 
intent. 


Tools that generate the whole configuration for each 
host each time [1, 2, 10, 22, 23, 26, 27, 33] neatly 
avoid this problem, at the cost of being somewhat lim- 
ited in scope and unable to handle large changes such 
as software subsystem installation and removal. 


Transactional Languages 


With these goals in mind, we defined a new kind 
of configuration language that has both imperative and 
declarative aspects. 


Definition 1 A transactional configuration lan- 
guage is one in which configuration is expressed 
as a sequence of atomic (indivisible) changes in 
behavior from a given and known base state. 


For purposes of analysis, a transactional language has 
at least two basic primitives, assert and retract. The 
command 


assert {behavior} 
causes a behavior to be exhibited, while 
retract {behavior} 


causes the behavior to become absent. This choice of 
primitives is arbitrary but allows us to discuss several 
effects of transactional language easily. The transac- 
tions might as well be SQL queries into databases or 
even XQUERYs into XML. 


A transactional language has somewhat of an 
imperative quality to it, because order sometimes mat- 
ters, e.g., the order of assert and retract for the same 
behavior determines whether that behavior is present. 
The key to our argument and work is that it is also 
possible — by design — to give the transactional lan- 
guage a declarative flavor as well. 


Reducibility to Assertions 


At present, our language has no constraints; most 
any kind of assert and retract statements are allowed. 
Our next job is to make it possible to simplify com- 
plex command sequences. 


2“Be careful what you command, my son. A command, 
once given, must be repeated forever.” — Duke Leto Atrei- 
des, Frank Herbert’s Dune 

3Any resemblance to the primitives with the same names in 
the programming language Prolog is purely intentional. 
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Definition 2 A transactional language L£ contain- 
ing only assert and retract statements is reducible 
to assertions if for any sequence of assert and 
retract statements, there is a subsequence of assert 
statements alone that has the exact same behay- 
ioral effect. 


Reducibility means that an assertion cannot apply 
“halfway.” If there is a state in which a retraction can- 
cels part of an assertion, then the assertion is “half 
right.”” In a reducible language, retractions cancel 
assertions either fully or not at all. 


As an example of a non-reducible language, sup- 
pose that directory indexing is turned on by default 
and one must manually retract the behavior after 
asserting the contents of the directory. To make this 
language reducible to assertions, indexing instead 
must be off by default. 


Reducibility has a subtle but important effect 
upon language. If a language is reducible, the results 
of any set of transactions can be expressed with “‘posi- 
tive language”: what should happen, without mention 
of what should not happen. Since retractions are order- 
dependent, but assertions typically are not, making a 
language reducible to assertions has the primary effect 
that one can express behavioral outcomes in largely 
order-independent fashion. 


Reducibility to Declarations 


The ability to eliminate retractions from a lan- 
guage is just one kind of reducibility: 


Definition 3 Let L£ be a transactional language 
and let 
Ps (Sa, Sp) | Sao5ShE i} 
be a partial order on elements of L£, where 
(sq,5,)€P exactly when s, must precede sy). 
Then L is reducible to declarations if for every 
sequence of transactions (f,,...,4,), there is a 
subset D = {d),...,d,} c {t,,...,t,} of the set 
of transactions (where duplicates are eliminated), 
where for every total ordering (e,,...,e;,) of D 
consistent with the partial order P, the behav- 
ioral result of applying the sequence (e),..., e;) 
is the same as that of applying the sequence 
(Bi sie -eisn)> 
This is a complex and perhaps overly wordy way of 
expressing a relatively simple idea. A transactional 
language is reducible to declarations if for every 
sequence of transactions in the language, there is a 
subset that does the same thing in any reasonable 
order in which it is applied. In writing down this sub- 
set, ““order does not matter” because we already know 
the partial order P describing how to appropriately 
order execution of the particular transactions within 
the subset. 


For example, suppose that we have the following 
transactions: 
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assert A 
assert A.X 
assert B 
assert B.Y 


retract B 
and the partial order: 


{ (assert A, assert A.X), 
(assert B, assert B.Y) } 


meaning that one must create A or B before creating 
their substructures A.X or B.Y. Suppose that retract B 
retracts the substructure as well: this is allowed. Then 
we can write an equivalent set of operations as the set 
{ assert A.X, assert A } where the order of this set is 
unimportant, because we know that assert A.X must 
follow assert A from the partial order. In our closure, 
the order constraints are that one must post the con- 
tents of parent directories before posting subdirecto- 
ries; the effect is identical to that in this example. 


Whether a language is “declarative” depends 
upon what we know about the elements of that lan- 
guage and their sequencing. If we are absolutely sure 
of the appropriate sequences, the order of writing 
down the elements does not matter; we can resort 
them into an appropriate order later. A transactional 
language is reducible to declarations if we can elimi- 
nate conflicts from the sequence of declarations so 
that order does not matter in the resulting reduced set. 


Statelessness 


While our task requires operations that are reduc- 
ible to declarations, this is not quite enough: 


Definition 4 A transaction or sequence of trans- 
actions p is idempotent if repeating p twice in 
succession has the same effect as executing p 
once. 


In other words, once p is successful, doing p again 
does nothing. More generally, 


Let £ be a set of commands. A is stateless if for 
any command pe JL and any sequence q),..., 
gn€L, applying p followed by q,...,g, fol- 
lowed by p has the same effect as the sequence 
Gis+++s4n,p. In other words, the initial execution 
of p before the sequence does not matter. 


Statelessness trivially implies idempotence. 


The reason statelessness is important is that it is 
related to idempotence of sequences: 


Definition 5 A set of commands L£ is sequence- 
idempotent if for any sequence of commands 
P\,--+»Pn from L, applying the sequence twice 
has the same effect as applying it once. 


This is important because 


Proposition 1 If L£ is stateless, then A is 
sequence-idempotent, i.e, the set of all 
sequences of commands taken from L£ is idempo- 
tent. 
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The proof of this is contained in [16]. This fact allows 
us to translate between descriptions of a configuration 
that are command-based and those that are instead 
declarative: 


Definition 6 A language L is reducible to con- 
vergence if it is reducible to declarations and the 
declarations, used upon the closure, are idempo- 
tent as a sequence. 


In other words, given any sequence (f),...,¢,),¢; €L£, 
there is a subset d),...,d, such that if (e,,...,e,) 1s an 
ordering of d),...,d, conforming to the partial order 
P, applying the sequence (e),...,e,) twice does noth- 
ing different than applying it once. In particular, 
applying this sequence to the configured closure does 
nothing at all, while applying it to an unconfigured 
closure reconstructs the closure’s current state. 


Thus we have the immediate result that: 


Proposition 2 If £ is both reducible to declara- 
tions and stateless, then £ is reducible to conver- 
gence. 


This is a trivial corollary to Proposition 1. Note that 
statelessness is a sufficient but not necessarily 
required condition for sequence idempotence, so that 
it is a sufficient but not necessarily required condition 
for being reducible to convergence. 


Finally, we relate this to behavior of the overall 
closure: 


Proposition 3 Suppose that there is a one-to-one 
correspondence between behaviors and configura- 
tions, and that £ is a set of transactions that 
change configuration. Suppose that L£ is reducible 
to convergence, (f;,...,4,) is a sequence of opera- 
tions in A reducible to the declarations 
{d,,...,d,}, and (e,...,e,) is one order of 
d,,...,d, compliant with the partial order P. Then 
the operator e, --- e, formed by applying e),..., & 
in order is convergent in the sense of [4, 5, 6, 7], 
i.e., it is idempotent as a sequence and inter- 
changeable with the sequence ¢, --+¢, in assuring 
the same behaviors. 


Proof: Given (t;,...,t,), we know that L£ is reducible 
to convergence, so that (e),...,e@,) exists by defini- 
tion. We also know that from a behavioral standpoint, 
applying e,---e, has the same behavioral effect as 
applying ¢,;---¢,. Since there is a one-to-one map 
between behavior and configuration, these also there- 
fore assert the exact same configuration. Since e; - -- é 
is idempotent, it will not change that configuration. 0 


This is a bit tricky, as one must require a corre- 
spondence between behavior and configuration. In 
cases where gratuitous differences exist between con- 
figuration and behavior, we can still get into a state 
where u, +++, is not idempotent while ¢; ---¢, is. For 
example, consider the transactions: 

ei = {Apes J 
2 @ 1 Been) 


and suppose that the value of A affects behavior but 
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the value of B does not. Due to this, ff, is reducible to 
t;, but applying the sequence f)ff, results in differing 
behavior than applying (¢;f,) alone. There are two 
solutions to this: either disallow use of non-behavioral 
values in transactions [18], or limit one’s self to a defi- 
nition of configuration in which non-behavioral values 
do not appear. 


Impact of Statelessness and Reducibility 


The purport of the above mathematical discus- 
sion is that if the commands in a language are state- 
less, then reducibility to convergence implies repro- 
ducibility of effect, i.e., the reduction suffices as a 
declaration of state. There are several benefits of hav- 
ing a configuration language with these properties: 

1. At all times, it is possible to express the effect 
of a sequence of commands in the same lan- 
guage that the commands use themselves. This 
eliminates the problem of “‘semantic distance”’ 
[13] in which the language used to declare state 
differs substantively from the language used to 
assure It. 

2. This effect is expressed in terms of positive and 
non-conflicting assertions. 

3. In a recovery situation, these assertions suffice 
as commands to reproduce current state. 


These are desirable properties for any language, 
but statelessness would seem a very strong condition 
upon a language. What kinds of languages have we 
eliminated from consideration? 


A stateless language is simply one in which all 
assertions are made with absolute (constant) values for 
parameters (or, at least, parameters that can be con- 
verted unambiguously to absolute form, such as rela- 
tive pathnames). A stateless language cannot allow 
incrementing or decrementing a configuration parame- 
ter, or base one parameter’s value upon that of another. 
This is a stateful (and non-idempotent) change by 
nature (i.e., pp # p). 


More subtle, the identity of the parameter that 
gets set by an operation p cannot be a function of 
some other setting. Suppose we have parameters A, B, 
C, and that the operation p sets B to | if A is 0, and C 
to 1 otherwise. Suppose that 4 is initially 0 and the 
operation g is A:= 1. Then pgp has a different effect 
(A4=1,B=1,C=1) than gp (A=1,B=0,C=1), 
violating statelessness. 

Stateless Transactions and XML 


It would seem that reducibility to assertions and 
statelessness impose rather extreme limits on what a 
command language can do. An immediate question 
about stateless transactions is whether one can create a 
set of representable (reducible and stateless) transac- 
tions that can maintain any kind of configuration file. 
Most configuration files are hierarchical in structure, 
and any reasonably consistent hierarchical structure 
can be expressed by an XML document type definition 
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(DTD), so it suffices to show that one can correctly 
maintain the contents of XML files via stateless and 
reducible transactions. 


The allowable forms of an XML document are 
described by its Document Type Definition (DTD). 
The DTD describes the allowable contents of each 
kind of XML node by a D7D rule. This rule expresses 
the structure of allowable content for the node as a 
regular expression in which tokens are node labels. 
There are three constructions one can use to make a 
DTD rule: 

1. Sequencing: the expression ‘“‘A,B,C” matches a 
sequence of nodes: a node named A followed 
by a node named B followed by a node named 
C. 

2. Alternation: “‘A|B|C”’ matches exactly one of A, 
B, or C. 

3. Repetition: ““A*’’ matches zero or more copies 
of a node named A in sequence. This is the 
“Kleene star” operator. 


More complex specifications are regular expres- 
sions containing the above operators, e.g., one can say 
that a node named A contains either a node named B or 
a sequence of nodes named C followed by a node 
named D by describing A’s content via the regular 
expression pattern ““B|(C*,D)”’. This is described in a 
DTD by the rule 


<!ELEMENT A (B|(C*,D))> 
This rule allows XML such as 


<A> 
<B>’. « «</B> 
</A> 

and 
<A> 
KC... SFC? 
CG? « </E> 
SOPs < chI DO 
</A> 

but disallows XML such as 
<A> 
ECP. ...%f E> 
</A> 


(because C cannot appear alone inside A in the above 
rule). Here ... in the text represents content conforming 
to (yet to be described) rules for the content of C and 
D. Other DTD constructions, including + for “tone or 
more instances,” are expressible using these construc- 
tions: A+ is just “A,A*”’. 

Without loss of generality and to ease notation, 
we do not consider element ATTRIBUTE declarations in 
XML. These are easily modeled as subordinate ele- 
ments of the element to which they apply. 

Our initial try at defining XML transactions will 
be based upon the XPATH language for identifying 
nodesets within XML files. Every XML file contains 
nodes that contain other nodes as content. In the file 

<foo> 
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<bar> 
<g00>1</goo> 
<cat>3</cat> 
<g00?4</goo> 
</bar> 
</foo> 


there are five nodes, including one foo, one bar, two 
goos, and one cat. A nodeset within an XML file is a 
set of nodes within the file having common attributes. 
XPATH is a language for identifying nodesets, using a 
notation similar to that used to identify files within a 
filesystem. For example, the XPATH /foo/bar refers to 
all nodes named bar within a top-level node named 
foo, while /foo/bar[2] refers to the second such node 
named bar in sequence. The special component * refers 
to a component with any name; /foo/*[5] refers to the 
fifth component of the content of foo with any name. 
While this simple subset of XPATH suffices for our 
discussion, there are many other options too numerous 
to cover here. All that we need to remember for now is 
that an XPATH determines a set of nodes within the 
document for which the assertion controls resulting 
content. This set of nodes is uniquely determined by 
the current content of an XML document and an 
XPATH, and may be empty. 


The general form of an XML transaction is: 
assert <nodeset> <content> 


where <nodeset> specifies a set of nodes in the file to 
transform (in XPATH notation) and <content> is XML 
content that should replace any existing content in the 
nodeset. <content> can be empty, in which case the 
assertion has the effect of retracting content from the 
node. The assertion succeeds if the requested transac- 
tion is possible (the nodeset defined by the XPATH 
<nodeset> is non-empty) and the resulting transformed 
XML document conforms to the document’s DTD; 
otherwise it fails and does nothing to the document at 
all. Because DTDs describe the content allowable for 
each node, and because each assertion provides that 
content, either all replacements are legal or all replace- 
ments fail, together. 


For example, in the file 


<foo> 
<bar> 
<g00>1</go00> 
<cat>3</cat> 
<g00>4</goo> 
</bar> 
</foo> 


performing the command 
assert /foo/bar/goo[2] <ho>5</ho> 
would result in the document 


<foo> 
<bar> 
<g00>1</go00> 
<cat>3</cat> 
<goo><ho>5</ho></goo> 
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</bar> 
</foo> 


(provided that the resulting document is acceptable 
according to the DTD for the document). The XPATH 
[foo/bar/goo[2] matches the second instance of goo 
inside an instance of bar inside an instance of foo. The 
XPATH /foo/bar/goo would match both instances, so 
that the assertion 
assert /foo/bar/goo <ho>5</ho> 
will produce the document 
<foo> 
<bar> 
<goo><ho>5</ho></goo> 
<cat>3</cat> 
<goo><ho>5</ho></goo> 
</bar> 
</foo> 


Since one can re-assert the contents of the top-level 
node (via assert /), every state of the XML file is reach- 
able via such assertions. Also, such assertions are idem- 
potent; either an assertion succeeds (if its fotal effect 
conforms to the document’s DTD) or it does not succeed 
and does nothing to the document. If it does or does not 
succeed, repeating it immediately has the exact same 
effect (because the whole assertion is variable-free). 


It is a bit more difficult to see that 


Proposition 4 The set of all possible assertions 
A of the form 

assert <nodeset> <content> 

is stateless and reducible to assertions. 


Proof: Consider a transaction 7¢ A and a sequence S 
of transactions in A. Note that all content of T is con- 
stant; there are no variables that can change state 
within the transaction itself. Note also that no transac- 
tion in A can change the number of elements in the 
content of an element unless it also asserts all of the 
content of that element. 


Consider what happens when one applies 7ST in 
sequence. Either the nodeset determined by the 
XPATH in 7 changes or it stays the same. If it stays 
the same, then 7 has the same effect by definition, so 
that the first T need not be executed and 7 is stateless. 
If the nodeset changes, however, it must have changed 
as a result of assertions that change the whole content 
of nodes. These assertions must override the prior val- 
ues of T whenever they affect its nodeset. So in either 
case, 7 is stateless. As T and S were arbitrary, the 
whole language A is stateless. 


A is trivially reducible to assertions, as it has no 
retract statements at all! O 


Statelessness and Semantics 


So far, we have a very awkward system for edit- 
ing XML. It would be convenient to add two new 
primitives 

add <nodeset> “content? 
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and 
subtract <nodeset> <content> 


that augment and remove sequenced content from a 
node. add puts new content into a sequence, while sub- 
tract removes matching content from a sequence. The 
success of both operations is again dependent upon con- 
formance between the result and the document’s DTD. 


Without further constraints, we no longer have 
statelessness. To have statelessness, we must also have 
idempotence, but the add primitive is not even idem- 
potent; adding something twice results in two entries 
for the item instead of one. For example, consider the 
XML document 

<foo> 
<bar> 
<g00>10</go00> 
<g00>16</g00> 
</bar> 
</fo00> 


and the effect of two statements: 


add /foo/bar <go0>20</goo0> 
add /foo/bar <goo>20</goo> 


After these statements (and with no further con- 
straints) the resulting document would contain: 
<foo> 
<bar> 
<g00>10</goo> 
<g00?16</goo> 
<g00>20</g00> 
<g00>20</g00> 
</bar> 
</f00> 
instead of 
<foo> 
<bar> 
<g00>10</goo> 
<g00>16</goo> 
<g00>20</g00> 
</bar> 
</foo> 


(which would be the required result for a stateless add 
operation). 


The DTD for an XML document is syntactic; it 
describes what can be written but not what the written 
text means. What is missing from our model is a 
notion of semantics that would tell us when two con- 
figurations are equivalent. A model of semantics, in 
turn, allows one to understand what add and subtract 
should do in order to remain stateless. 


Note that idempotence and statelessness are not 
properties of what operations do, but of what the 
results mean. If operations act on a configuration file 
to produce the same contents, it is rather obvious that 
they result in the same behavior. However, sequences 
of operations that produce differing configuration files 
may produce the same behavior, e.g., if the differences 
in configuration do not produce differences in behavior. 
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In adding information to a sequence, one must 
ask several questions. Does order of the sequence mat- 
ter or not? Do duplicates matter, or does the first or last 
instance of a duplicate override the others? What con- 
stitutes a duplicate entry? These are semantic questions 
that go beyond the simple syntax described by a DTD. 


Preserving statelessness in using substructure 
addition and subtraction is a matter of both under- 
standing semantics and limiting operations to fit. If 
members of a sequence are pure declarations, so that 
order does not matter and duplicates are ignored, then 
add and subtract should behave accordingly. Trying to 
add a duplicate should have no effect. 


More subtle, the semantic definition of a dupli- 
cate often involves the notion of a unique key. For 
example, suppose that in Apache we are defining 
access rights to directories. Obviously, the directory to 
which we are defining access constitutes a unique key; 
we should not be able to define two different ideas of 
protection for one directory. 


This means that we need several context-sensi- 
tive notions of what add and subtract should do. 

1. If order of assertions matters and cannot be 
inferred from assertion content, the game is 
over. Language is imperative and statelessness 
is impossible. 

2. If order of assertions does not matter and dupli- 
cates can occur, then add cannot be stateless, 
because it is not even idempotent. 

3. If order of assertions does not matter and dupli- 
cates should not occur, the add command 
should have the form 


add <nodeset> <key> <content> 


where <nodeset> determines a set of nodes on 
which to operate, <key> is an XPATH relative to 
each node that determines a key that should be 
unique, and <content> is content to add. For 
example, given the XML 


<foo> 
<bar> 
<g00>10</goo0> 
<g0016</goo> 
</bar> 
</f00> 


the command 
add /foo/bar goo <goo>10</goo> 


does nothing at all, because 10 is already a key, 
while 


add /foo/bar goo <goo>20</goo> 
results in the document 


<foo> 
<bar> 
<g00210</goo> 
<g00>16</goo> 
<g00>20</go00> 
</bar> 
</fo00> 
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The extra goo in the assertion indicates that the 
content of goo is the key that should be unique. 


The moral of this section is that unless the XML 
being created is truly a declaration (i.e., order does not 
matter and duplicates should not exist), statelessness is 
impossible in the transactional language that updates it. 


Application to Configuration Management 


The impact of this mathematical theory upon the 
general problem of configuration management is sub- 
tle but inescapable. So far, we have largely ignored the 
problem of change management in host configuration 
management. Generative tools (that create an entire 
configuration from templates) largely avoid the issue 
of change management by erasing everything and 
starting over each time. However, these tools are rela- 
tively limited in scope, as they cannot handle major 
changes such as package management: installation and 
removal of software subsystems. Convergent tools 
allow one to become sloppy and forget that a compo- 
nent is managed, while imperative tools deal poorly 
with undoing changes. 


In the sub-problem of managing the configura- 
tion of a web server, change management is a central 
concern, so that we must adjust our practice and lan- 
guage to ease that task. The result, however, is that we 
created a framework for change management that 
applies to the more general problem of network con- 
figuration management. In fact, we have created an 
‘assembly language”’ that is the lowest level of a new 
strategy for configuration management. 


At its core, every configuration can be described 
in terms of a set of assertions that are true at a given 
time. In the simplest case, each assertion assigns val- 
ues to one or more “configuration parameters.” Since 
configuration values are always specified as absolute 
quantities in assertions, such assertions are naturally 
stateless. Most configuration management tools are 
driven by configuration files containing only stateless 
assertions of this kind. 


By viewing the assertions in such a configuration 
file as commands to be executed, the only thing we 
have added in our model is a concept of retraction of 
assertions. There is a constraint model that describes 
which assertions conflict with which others, and a rule 
that keeps current values consistent with one another, 
e.g., for our web server, we know that asserting a new 
index file for a directory is going to override the old 
index declaration. If we retract a virtual server, then all 
parameters for that server no longer exist. 


If the command language is reducible to asser- 
tions, no matter what incremental changes we make to 
the overall configuration through further assertions, 
the results remain precisely expressible as a set of 
assertions. Further, the assertions are sequence-idem- 
potent, so that repeating them has the same effect as 
doing them once. Thus the list of valid assertions is a 
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good substitute for the policy file found in many con- 
figuration management systems. 
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Figure 5: A new model of configuration manage- 
ment, in which low-level statements exhibit state- 
lessness while upper levels encapsulate stateful 
behavior. 
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This gives rise to a new model of configuration 
management (see Figure 5). At the core, a transaction 
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engine interprets a stateless language. This engine 
interprets stateless commands to effect requested 
changes in overall configuration. This engine is 
responsible for maintaining the list of valid assertions 
and retracting conflicting assertions. This engine deals 
with stateless commands only. 


At the next level, stateful commands are trans- 
lated into stateless commands for ease of use. While 
the user thinks in relative terms (“‘more space’’), the 
transactional system must think in terms of absolute 
requirements (“2 MB”). A simple memory mecha- 
nism here makes the translation between relative and 
absolute units. 


At the third level, meta-commands describing 
overall intent are translated into the assertions that cre- 
ate that intent. “‘Become a web server”’ is translated 
into the various assertions that cause that to happen. 


This rather strange way of accomplishing configu- 
ration management has a few rather obvious advantages: 

1. The whole history of the configuration of the 
machine is contained in one transaction stream. 

2. At any time, there is a deterministic procedure 
that can determine what configuration is in 
effect at that time. 

3. Storing the stream and its changes allows one 
to roll back time, by replaying the stream or 
selectively retracting the newest assertions, 
backwards. 

4. One can specify changes as incremental opera- 
tions upon a pre-existing structure. 
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Figure 6: Parts of a complete HTTP closure, where the dotted box indicates completed prototypes. 
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5. Changes can come from multiple sources (e.g., 
different administrators or groups) and will be 
disambiguated at implementation time. 

6. At any time, there is a coherent picture of the 
assertions currently in effect. 


These observations are not true in general of 
current convergent administrative tools, including 
Cfengine. In Cfengine, changes are specified by 
editing a monolithic file. There is no easy way to 
undo a configuration step. It is difficult for multiple 
people to collaborate on a single configuration without 
conflicts. In order to implement such a mechanism, 
Cfengine would have to have the ability to return a file 
to the state before any edits have been applied. 


This proposal needs much study before we 
implement such a language, but is clearly implied by 
our study of HTTP. 


Conclusions 


Work on this project has been a long road of dis- 
covery. Ad-hoc creation of a closure language — based 
upon a configuration language — led to much initial 
confusion. The exact things that make a configuration 
file easy to read make a command language confusing 
and difficult to use. Disambiguating that language 
required a mathematical approach: stateless com- 
mands removed the quandaries posed by stateful syn- 
tax. The result proves that for a limited problem 
domain, closures exist. 


But we are a very long way from coming to a com- 
plete closure. A complete solution seems to have more 
parts than we could have imagined initially(Figure 6). 
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1. To assure idempotence of operations, we need 
content staging, and an independent repository 
for staged data. There should be two content 
hierarchies, one for actual provision and the 
other a cached copy that allows restoring the 
provided data, e.g., after a crash. 

2. It is impossible to deal with dynamic content 
written into the web hierarchy by traditional 
means. This must be handled by some kind of 
dynamic storage closure (that may be a data- 
base, or perhaps something else). 

3. We had much difficulty verifying that a 
declared virtual server would work properly 
according to information in DNS and DHCP. 
We need the ability to converse with and nego- 
tiate with DNS and DHCP closures in order to 
determine whether declared virtual servers will 
work properly. 


As well, the actual configuration closure should 
have several parts that are not currently present(Figure 
7). A module management subsystem should allow 
dynamic selection of modules, while a constraint 
engine disallows impractical choices. Likewise, a vir- 
tual server management subsystem should disallow vir- 
tual server configurations that cannot work, e.g., declar- 
ing more than one ssl server for a single IP address. 


We also acknowledge that the end product may 
not be a single http service closure, but several differ- 
ing ones for different applications. Making the lan- 
guage simple enough to use in one application may 
preclude its use in another. For example, a closure 
whose language is simple enough for use by untrained 
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people may not be expressive enough to be appropri- 
ate for experts. 


This is only the beginning of our journey, and there 
are several caveats for those who would follow this path. 
1. First, beware of seemingly stateless languages 
whose environment is stateful. In our language, 
the act of copying files is stateless, but the files 
themselves can change between copies. So the 

copying commands are not truly idempotent. 

2. Second, beware of implementing the top level 
of a closure before the lower levels. We cannot 
really assure behavior, because our closure is 
not built on top of a foundation of lower-level 
closures. Assurance and trust must rise from the 
bottom levels rather than being imposed from 
above. This is the obvious way to settle quan- 
daries such as how to validate that virtual 
servers will receive requests, etc. 

3. Third, beware of making a closure that works 
around limitations that should rightly be there 
with good reason. We chose not to allow CGI 
scripts to create dynamic content within the clo- 
sure. This seems a limitation, but actually 
reflects best practices for web content creation. 
If CGlIs “‘should” be using databases, why 
should we allow them not to? Ideally these 
CGIs should be communicating with “data per- 
sistence closures,’ otherwise known as data- 
base management systems! 

4. Creating a closure that reacts predictably to 
configuration commands requires discipline in 
creating the command set. But many more dis- 
ciplines are required, and some features to 
which we are accustomed in existing paradigms 
— like CGI scripts editing server files — must 
cease in order for the closure to become reliable. 


In the final estimation, it remains unclear 
whether closures are the solution to the complexity of 
configuration management, and unclear whether rea- 
sonable stateless languages exist for other applica- 
tions. The most important lesson of this study is that 
practice must adapt to allow closure to exist. Without 
a fundamental change in language, the HTTP closure 
would have been impossible to create. 


Ours was not just a journey of software design, 
but also of evolving thinking. The future of that think- 
ing remains unknown. It is likely that as we try to cre- 
ate practical closures, more radical shifts in thinking 
and practice will be required. The answer seems to lie 
in the simple statement that language must carefully 
conform to needs. The need for simple subsystems 
that are easily managed will no doubt lead to “paths 
where no one thought.” 
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ABSTRACT 


Managing information flow between different parts of the enterprise information 
infrastructure can be a daunting task. We have grown too large to send the complete lists around 
anymore, instead we need to send just the changes of interest to the systems that want them. In 
addition, we wanted to eliminate “sneaker net” and have the systems communicate directly 
without human intervention. Some of our applications required real time updates, and for all cases, 
we needed to respect the “business rules” of the destination systems when entering information. 
This paper describes a general method for propagating changes of information while respecting the 


needs of the target systems. 


Introduction 


At LISA 2002, I presented a paper Embracing 
and Extending Windows 2000 [4] that described how 
we kept our Windows 2000 environment, as well as 
our LDAP directory services synchronised with our 
Unix account space. These feeds quickly grew to carry 
more than just Unix account information to include 
directory and other status information. Well, we were 
a victim of our own success. Other systems needed 
access to the same or similar change feeds, and other 
data streams were becoming available, and a more 
general architecture was needed. In addition, we found 
that we had to interface with vendor supplied systems 
and it became important to provide a clear demarca- 
tion between our systems and the vendor’s systems 
and provide a clear place to implement their business 
rules with our data. 


At LISA 96 in Chicago, I gave an invited talk 
Manage People, not Userids that demonstrated the 
importance of managing the more general information 
about people, and from that, managing their computer 
accounts. In addition, in a paper at the same confer- 
ence, (White Pages as a Problem in Systems Adminis- 
tration [3]), | again showed how tools for systems 
administration could benefit other areas and that many 
areas for code and tool re-use exist. As our friends in 
the JAVA community (and other object oriented lan- 
guages) are fond of telling us, solve the problem once, 
and re-use the solution to solve other problems. Thus, 
we wanted a general mechanism to move different 
types of changes to different systems. 


At our site, many of our systems! are vendor sup- 
plied packages running on an Oracle or other relational 
database. In addition, we were also feeding information 
to non relational database systems such as our LDAP 
directory servers and the Windows 2000 domain con- 
trollers. To further complicate matters, we have many 


‘Student Records, Human Resources, ID Card, Dining Ser- 
vices, Space Management, Telephone Billing, Help Desk, etc. 
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different data elements available, and not all systems 
wanted all data elements, we needed ways to pick and 
choose which data elements went to which system. We 
also needed to be able to accommodate different operat- 
ing schedules and data latency requirements. Some data 
elements change very slowly (such as adding a new 
building) where a daily update feed is more than ade- 
quate, while other data elements need to move much 
faster (such as a password change, or email forward- 
ing.) We wanted to retain the low processing costs we 
achieved in earlier implementations, while making it 
easier to add new “listeners” to a feed. Lastly, although 
we wanted changes to propagate quickly, we needed to 
avoid blocking an operation on one system because a 
downstream system was not reachable. 


Interfaces and Business Rules 


The first aspect of this project, is the interface 
model we use to actually get the changes into the des- 
tination system. While many applications have proce- 
dures to import a CSV file, these require manual activ- 
ity and our objective is to fully automate the process. 
Some applications and systems provide an API that 
we can call to insert and update records; this is our 
preferred method. But other systems don’t provide that 
and for at least database based systems, we need to 
muck about directly in the vendor database tables. We 
wanted a clear demarkation between our systems, and 
the interface code that needs to understand how the 
target system works. For the systems without an API, 
our approach is to insert the changed records into a 
import table and have that trigger the appropriate pro- 
cessing. We have used this model as well as the API 
model successfully. 


Assuming that we have some sort of interface, we 
still need to face the classic system admin issue of 
pushing in changes from the central server, versus 
pulling in changes from the client. The answer here is 
“it depends.” In general, I have taken a very pragmatic 
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approach. For destination systems that do not have 
‘‘ageressive administration,”* I prefer the push model 
from the central server. This allows me to monitor the 
connections and updates and become aware of prob- 
lems (and hopefully resolve them) before the end 
users. This also allows me to adjust the schedule and 
timing of updates as needed. For systems with aggres- 
sive administration, we can negotiate the “best” 
approach (more efficient, least work, etc.). 


Procedural API 


At the heart of the Meta Change Queue package is 
the Get_Changes routine (Figure 1) which provides all 
of the changes for the specified listener in order. This is 
called with the processor (queue) name, and an optional 
table name within that queue. This will return a record 
with a number of fields of interest (Table 1). When the 
record has been processed, the Ack_Change routine is 
called. This cycle is repeated until the Change_Type 
field in the record is null. This indicates that there are 
no more changes that need to be processed. 


Function Get_Changes ( 
Proc_Name in varchar2, 
tname in varchar2) 

return rec; 


procedure Ack_Change(R in Rec); 
Figure 1: Get Changes definition. 


When an application is processing a change, it 
examines the change record, and based on the Tname 
and subtype (and other fields), determines what record 
had changed and gets the current value of that record 
from the database. This is a very important issue to 
understand, we do not record what the change was, 
only that something had changed. We need to be able 
to move the final state for a record, without having to 
step through intermediate steps. If I change my phone 
number twice, the only thing that matters is the final 
number. Other aspects of our systems may maintain 


2An administrative team who is constantly monitoring the 
system and is able and willing to set up cron jobs or the 
equivalent. 


history and change logs, but not this one. Here we 
only indicate that something changed. The application 
must be able to apply the same change twice without 
harm, i.e., “set quota to 100” is ok to repeat, 
“increase quota by 50” is not. 


There is another set of routines that given a 
change record, will return the desired information 
(directory, status, etc.) to applications that can then 
update the target system. This model has worked well 
with our interfaces to LDAP and Active Directory 
where we have written a program in Java or C#, that 
gets the queued changes and updates the target. These 
applications apply all the changes in the queue, 
acknowledging them as they go. Once it reaches the 
end of the queue, it sleeps for a short time and looks 
for changes again. These programs will retry if they 
loose the network connection and will eventually 
catch up once they can reconnect. This automatic 
restart has proven very handy and reliable. 


The Get_Changes interface also provides a handy 
hook for our process monitoring system [5]. The 
applications that are polling via the Get_Changes rou- 
tine often just sleep for a short time; maybe a second. 
Unlike the calls to Get_Changes which puts a very 
small load on the database, calls to the Mark_Process 
routine results in a write (or update) to the database, 
and frequent calls will impact performance and trans- 
action logs. So we typically wrap the call to 
Mark_Process in code that skips the actual call until at 
least five minutes has elapsed since the last call. This 
will still give us good notification when one of these 
processes dies. We usually catch one that has died 
every three or four months. 


Import Table 
Our second interface method is by using an 
import table to receive records. When a record in 


inserted into the import table, a database trigger® fires 
which will then process whatever business rules that 


3A database trigger is a stored procedure in the database 


that will be executed whenever there is an insert, update or 
delete on a row in a database table [1]. 


Description 


More detailed information about what specifically changed about the target 


Field 
Tname varchar2(32) The name of the table that had the change. 
Change Type | varchar2(8) One of “‘Insert,”’ ““Delete,” or “Update.” 
rrowid rowid Oracle row identifier of base table record. 
proc name varchar2(32) The processor (or queue) name. 
subtype varchar2(32) 

object. 

person_id number 


varchar2(32) 
number 


Pkey_ String 
pkey number 





The internal person identifier if the object is defined in the “‘people”’ table. 
The primary key (identifier) of the object (if not a person) as a character string. 
The primary key of the object when that is a numeric value (not a character 


An optional extra character field to identify the change. Often used for 
membership changes where two keys are needed. 
The time and date this change was made. 


Table 1: Change record definition. 


string). 
aux_ string varchar2(255) 
entry date date 
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Figure 2: Interface and business rules. 


are required. This appears as the bottom row of ele- 
ments in Figure 2. We have used this successfully with 
several different vendor applications.* In cooperation 
with the vendor engineers, we define an import table, 
and then the vendor engineer writes a database trigger 
that processes each insertion as it happens and makes 
the appropriate changes in their own tables. This 
allows us to feed in the changes in a controlled 


4BEST — ID Card and Access Control, FAMIS — Physical 
plant trouble ticketing, INSITE — Space Management 
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manner and isolates our code from vendor changes. 
The vendor does need to update their triggers when 
they make a change. We had originally intended to 
queue the records in the import table, and then the 
vendor would have a process that looks for pending 
records (much like how we did the Meta Change 
Queue project), but we found it easier to just write the 
trigger and avoid writing the polling application. 


One example of this is seen in Figure 3 which is 
a database trigger written by a vendor engineer. In this 
case, for each new entry in the Simon_Person_Import ta- 
ble, it first checks to see if the entry has already been 
made in the vendor table SA_PERSON, and if not, it 
inserts the person. If the person is already in the table, 
it checks to see if the person has a status.° If they don’t 
have a status, see if they did, and if so, change it to 
‘‘“Former-”” whatever and then update the person’s 
record. The vendor application did not a have a field 
for the “Status” of a person, and although we could 


SMaintaining “status” values for every person is a topic for 


another paper. 


CREATE OR REPLACE TRIGGER T_SIMON_PERSON_IMPORT 
BEFORE INSERT ON SIMON_PERSON_IMPORT FOR EACH ROW 


declare 
Cursor Get_rec (pn number) is 


Select person_number,person_last_name,person_first_name, 


person_textl, 
from SA_Person where person_number = 
R Get_Rec%RowType; 
new_status varchar2(48); 
begin 

Open Get_Rec(:new.spriden_id) ; 
Fetch Get_Rec into R; 
if Get_Rec%NotFound 
then 

new_status := 

INSERT INTO SA_PERSON 


person_location, person_memo, 
pn; 


-- No existing record, 


rowid 


insert a new one 


nvl(:new.status,’No Status’); 


(PERSON_ID, ENTERED_DATE, ENTERED_BY, PERSON_NUMBER, PERSON_LAST_NAME, 


PERSON_FIRST_NAME, PERSON_TEAT1, 


VALUES 
(PERSON_ID_SEQ.NEXTVAL, 


SYSDATE, USER, 
substr(new_status,1,24), 


PERSON _ LOCATION, PERSON_MEMO) 


:new.spriden_id, :new.lastname, 
:new.title) ; 


save what they were 
= "Former-"* 


R.Person_Location; 


:snew.firstname, orgn, 
else 

if :new.status is null -- If no current status, 

then 
if substr(R.person_location,1,7) 
then new_status := r.person_location; 
else new_status := ’Former-’ | | 
end if; 

else 

new_status := :new.status; 
end if; 


Update SA_Person 

set Updated_Date = 
Person_Last_Name = 
Person_Location = 
person_textl = orgn, 

where rowid = r.rowid; 
end if; 
end; 


sysdate, Updated_By = 

:new. lastname, 
substr(new_status,1,24), 
person_memo = 


user, 
Person_First_Name = 


:new.title 


Figure 3: Insite Trigger. 
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have added one to their table (like we did with the Per- 
son_Number, we did not want to change all of the dis- 
play screens, so we took over the Person_Location 
field, and store and display the status there. 


We don’t actually care about the contents of the 
Simon_Person_Import table. Once the trigger fires and 
completes, all of the work is done. We periodically 
flush the import table. If there is a problem with the 
trigger, perhaps some integrity constraint (unique user- 
names, etc.) is violated, the trigger throws an excep- 
tion and the insert fails. This exception propagates 
back to the system attempting the insert and appropri- 
ate error handling can take place there. 


This approach has the additional advantage of 
allowing us real time updates for applications that 
needed it. For example, we have a secure web page 
that is used by our Human Resources department to 
mark when a new employee has signed their I9 form 
(and is now allowed to start work). This web form 
updates the person’s status, and immediately pushes 
that change to the ID card system. By the time the new 
employee has made it to the ID desk, they have 
already been loaded in and can have their ID card 
photo taken right away. This has made both the HR 
staff and the ID desk staff happy (HR is happy 
because they can now control when someone is issued 
a staff ID card, and the ID Desk staff is happy because 
they don’t need to call HR to verify each new hire.) 


Not all changes need to happen in real time. 
Many changes happen as the result of other automated 
processes and batch jobs. We have a simple PL/SQL 
program that uses the Get_Changes routine to find out 
what has changed for a given queue, and then loads 
the appropriate records into the import table. If the 


procedure Push_Queue(stopcount in number) 


target system is down, the changes will wait in the 
queue until the next run. Since we are using the 
process monitor to ensure that this happens, we know 
when the scheduled jobs does not complete success- 
fully. In the new employee case, we have already 
loaded the employee via the HR web page, but the 
repeat load in the next batch run doesn’t hurt anything. 


We can combine the use of the queuing support 
described in the previous section, with the insert trig- 
ger based code, to come up with a catch up routine 
like the one in Figure 4. This simply looks for changes 
for the ‘Insite’ queue, and passes them to the Insite sys- 
tem via the Push_Person routine we described earlier. 
Once we get to the end of the list, we record the fact 
we finished and terminate. This process is called once 
a day by a cron job. 


Manual Entries 


When we bring a new system on line, it is gener- 
ally empty of our data. Rather than loading it via CSV 
files or other bulk import tools, we use the Meta 
Change Queue interface to load them up. In the cases 
where there is a program calling the Get_Changes rou- 
tine directly, we simply manually insert records in the 
queue for that service, and watch what happens. If we 
like what we see, we write a simple script to load all 
objects of interest into the queue. From that point on, 
things run automatically, and the interface has been 
well tested, as the entire system load has been pro- 
cessed via the new interface. This also makes it easy 
to reload if we decided to flush and start over. 


In Figure 5, we have an example of a PL/SQL 
script that will select all transfer students from the Fall 
of 2002, and “refresh” their entries in any listener that 


is 
R Meta_Change_Access.Rec; 
begin 
loop 
R := Meta_Change_ Access.Get_Changes(’Insite’, Null); 
exit when R.Tname is null; 
Push_Person(R.Person_Id); 
Meta_Change_Access.Ack_Change(R) ; 
end loop; 


Process_Monitor_Record.Mark_Proc(’Insite-Push_People’); 


end Push_Queue; 


Figure 4: Push Queue procedure. 


declare 


Cursor Lrec is Select Username,Source, 


owner, unixuid, rowid 


from logins where admit_cohort=’TR200209'’ ; 


begin 
for.t. in: ree 
loop 


Meta_Change_Rtn.Log Update(’LOGINS’,1l.rowid, person_id => 1l.owner, 
pkey_string => l.username, pkey_number => 1.unixuid); 


end loop; 
end; 


Figure 5: Manual refresh via queue. 
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is interested in changes to the LOGINS table.® You will 
note that several of the parameters specified in the call 
to LOG_UPDATE correspond with fields in the change 
record (Table 1). 


In the cases where we use the import (trigger) ta- 
ble, we generally have written a routine like 
Insite_Interface.Push_Person (Figure 6) that will look up 
the appropriate information and do the appropriate 
insert (via SQL*NET). This routine can be called by 
hand for testing, and later on via scripts to bulk load 
the entire population. In Figure 7, we have an example 
of PL/SQL script that will load all current employees 
and faculty into the INSITE (space management) sys- 
tem via the Insite_Interface.Push_Person routine. This 
routine calls routines in the Meta_Change_Data pack- 
age to get the data elements that are needed, and then 
inserts that into the import table Simon_Person_Import 
on the Insite machine using sql*net. 


Tables and Listeners 


The second aspect of the project, is how we 
detect changes, queue them, and finally deliver those 
changes in a timely manner. 


Defining Tables 


The original concept was to track changes in a 
particular database table, but in the actual implementa- 
tion, this proved to be limiting. Instead of looking at 
the details of the source systems tables, we looked at 
the data requirements of the destination system. For 
example, one system might just want general informa- 
tion on a person such as name and status, while 
another system would want that as well as directory 
information. Since the transfer model was to give 


6The query has been edited for space, but the concept is 
still valid. 


them a complete record of all desired information 
about a person, a facility to pick and choose what 
information about a person, was desired. Instead, we 
defined the table to be the source of the primary key, 
and added a sub type to indicate what about the base 
object changed. For example, a telephone number 
change would be marked as the PEOPLE table and the 
Telephone sub type. We currently have 16 table and 
sub type combinations defined (Table 2). 


To detect these changes, we set up a database 
trigger (Figure 8) which records whenever a telephone 
number is changed. There are similar triggers to handle 
new telephone numbers (inserts) and deleted telephone 
numbers. Since this was done with a database trigger, 
we did not have to change any of the applications that 
had been previously developed to make changes. It 
also ensures that we don’t miss any changes. 


ste eeeeeseawes eeserseneenee 


Source 


be : . gi sseasuasecec 
Terst, ° Log -._* Interest 
“>, Procedure Table 


eaneceee’ 
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: Change “Queue” Table: 


Figure 9: Detecting changes with triggers. 


While database triggers can be very handy for 
integrating existing applications, they can sometimes 
get complicated. We often have changes to a table that 
are “housekeeping” in nature. Something in the table 
changed, but that change is not of interest to any 
downstream systems. You can with a trigger be more 
selective about what columns you look for changes in, 
but that makes the trigger more complex. Triggers are 
also challenging from a maintenance prospective, as 
they are sort of split conceptually between the table 


Meta_Change Data.Person(Person_Id, Lname, Fname, Mname, 
PFN, Rin, Iso, DOB, Gender, Ssn, Pidm); 

Meta_Change_ Data.Person_Department (Person_Id, Department, Division, 
Portfolio, Insite_Name, Orgn Code; 

Meta_Change Data.Person_Status(Person_Id, Category, ID_Card,); 

Meta_Change Data.Person_Directory(Person_Id, Title, Camp_Add, 
Camp_Phone, Camp_Fax, Mailstop) ; 


Insert into OPSSINSITESYS.SIMON_PERSON_IMPORT@insite 
(Spriden_Id, Lastname, Firstname, Orgn_ Code, Status, Title) 

Values (Rin, upper(substr(lname,1,24)),upper(substr(nvl(pfn, fname) ,1,16)), 
Orgn_Code, upper(ID_Card), upper(Title)); 


Figure 6: Insite Interface.Push Person. 


declare 


Cursor Emp_List is 
Select person_id,spriden_id,lastname 
from people 
where id_card_status in (’Employee’,’Faculty’); 


begin 


for R in Emp_List loop 
Insite_Interface.Push_Person(R.person_Id) ; 
end loop; 


end; 


Figure 7: Direct refresh (trigger table). 
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definition (DDL) which is usually set at the start of the 
project and the interface code (PL/SQL Packages). 
New projects allow for closer integration of the 
change queue requirements with the interface code. 


We recently installed a unified messaging sys- 
tem.’ Although this system was supposed to use our 
existing Exchange server (the one discussed in 
Embracing and Extending Windows 2000 [4] ), our 
initial deployment required a second Exchange server, 
and in fact, its own Windows 2000 domain. An obvi- 
Ous step was to set up another listener in parallel to the 
one used for our primary Windows 2000 domain. 
However, along with the LOGINS information needed, 
we also needed voice mail specific information. 


Since this was a new project, we were able to 
design the system so that all access to the “‘voice 
mail” tables was via single interface package. This 
allowed us to call the Meta_Change_Rtn.Log_XXX rou- 
tines directly as needed. This gave us much greater 
flexibility in what we send to the Unity system for 
processing. For example, we have two ‘“‘owners”’ for 
many objects. We have the “Unity Owner” which 
controls some access on the Unity system itself, and 
“System owner,” which controls administrative 
access on the central database. For operational rea- 
sons, these often different entities. A voice mail tree 
will be administratively owned by a department, while 
on the Unity system, it will be “owned” by a group of 


’Cisco Unity — voice messages and email are co-mingled 
on an Exchange server, with access to both via both the tele- 
phone and Outlook or other email agents 
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administrators. We often need to change the adminis- 
trative owner, but there is no need to send any changes 
to Unity. By having the single interface, this can be 
handled properly in the interface package. 


In order to manage what tables are available to 
the listeners, we defined another database table, 
Meta_Change_Tables (Table 3) to hold that information. 
The primary purpose of this table is to document what 
is available. Most of this information is set when the 
table is defined, but one aspect is collected automati- 
cally. The first time a change record is logged for a 
specific Tname SubType pair, the PL/SQL call stack is 
saved to this table. This is a traceback of what proce- 
dures and packages called the logging routine. This 
can be very handy when tracing odd entries. This 
value will get refreshed if the Stack_Date value is 
cleared. This table also provides a handy selection list 
of possible tables when setting up listeners. 


Defining Listeners 


It generally doesn’t do any good to talk, if no one 
is listening. There are three parts to each listener, an 
entry in the Meta_Change_Listeners table (Table 4), a 
listener specific interface package (such as the 
Insite_Interface package mentioned previously) and the 
actually interface application, be it an import table or a 
custom application. Like the Meta_Change_Tables, we 
also record the call stack of whoever calls for this lis- 
tener. Although we have some concept of role based 
access control built in for each queue, in all of our 
deployments so far, we have written a specific interface 
package which provides the access control we need. 








Table sub type 
BUILDINGS 
DEPARTMENT Departments from 
GROUPS Unix Groups 
GROUP MEM Members of Unix Groups 


INSITE FLOOR 
INSITE_SITE 


Buildings (from INSITE Space Management) 


Floors within buildings (from Insite) 
Campuses (from Insite) 


Description 


the phone directory 





LOCATIONS Rooms within Buildings (from Insite) 11890 
LOGINS Computer accounts (email) 61632 
PERSON Address Address information 209016 
PERSON Dir_Orgn | Departmenal affiliation from directory 406 
PERSON Merge Database cleanup — really ugly 224 
PERSON PEOPLE Basic person information, Name, DOB, ID Numbers 160850 
PERSON Status Current status for a person (Student, Employee, etc.) 95468 
PERSON Telephone | Telephone number (home, campus, etc.) 78960 
PERSON UDI User Directory Information: Class Year, web page, email address 5725 
UNITY _VMAIL Command for Unity Voice Messaging System. 6072 


Table 2: Tables and sub type. 





Create or Replace Trigger Directory_Telephone_Trig Upd 
after update on Directory_Telephone for each row 
begin 
Meta_Change_ Rtn.Log_Update( tname => 'PERSON’, 
subtype => '’Telephone’, rrowid => :new.rowid, 
person_id => :new.Person_Id, Aux_String => 
end Directory_Telephone_Trig Upd; 


Figure 8: Telephone change trigger. 
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:new.tele_type); 
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We currently have seven listeners defined (Table 
5). Of those, three are “real time,” polling for changes 
frequently, and the others get once a day updates. In 
addition, both BEST and CMMS have interactive tools 
available to push through individual records on demand. 


Linking Listeners with Tables 


The last part of the puzzle is the Meta_Change_ 
Interests table (Table 6) which defines which table and 
subtype pairs any given listener is interested in. This 
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mapping is maintained with a web based tool, making it 
very easy to maintain these relationships. This tool also 
allows you to display pending and processed change 
counts, flush pending records (handy during develop- 
ment), as well as the call stacks for tables and listeners. 


When a call is made to one of the 
Meta_Change_Rtn.Log_XXX routines, it takes the Tname 
and Subtype parameters, looks for listeners in the 
Meta_Change_Interests table (Table 6) that are interested, 


Field Description 

TNAME varchar2(32) Primary Key(l1) -— The table we are monitoring. Matches 
Meta Change Queue.Tname. 

SUBTYPE varchar2(32) Primary Key(2) — An optional subtype of the table. 

COMMENTS varchar2(255) A short description of what we are logging. Intended to help 
developers. 

CALL STACK varchar2(2000) | The formatted ‘‘call stack” that made the a log entry. This is set when 
Stack Date is null. 

STACK DATE date The date when the latest call stack was recorded. This will trigger 
refresh of the call stack data. 

PERSON _ ID varchar2(65) The source (if any) of the person_id value. 

PKEY STRING varchar2(255) The source of the pkey_string. This may be a composite value. 


PKEY NUMBER 








varchar2(255) The source, if any for the pkey_number. These values are generally not 


person_id values. 
The source of the aux_ string. This may be a composite value. 


Table 3: Meta Change Tables definition. 


Description 
Primary Key(1) — The name of the valid listener. Used in the 
A description of what this listener is. 
An optional Oracle role needed to access this queue. 
The simon.people.id of who “‘owns”’ this queue. 
The formatted “‘call stack” that made the a log entry. This is set when 


The date when the latest call stack was recorded. This will trigger 


Table 4: Meta Change Listeners definition. 


AUX_ STRING varchar2(255) 
Field Type 
PROC NAME || varchar2(8) 
Get_Changes function call. 
COMMENTS varchar2( 1024) 
ROLE varchar2(32) 
OWNER number 
CALL STACK | varchar2(2000) 
Stack_Date is null. 
STACK DATE | date 
refresh of the call stack data. 
Listener Frequency 


ADSI 
Applix 
BEST 
CMMS 
Insite 
LDAP 
Unity 


Field 
PROC NAME | varchar2(8) 
TNAME varchar2(32) 
Subtype Varchar(32) 
COMMENTS varchar2(1024) 
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Description 


Active Directory — our primary windows 2000 domain. 
The trouble ticketing system for the computer center. 
ID card and physical access control system 

Physical plant trouble ticket and payroll system. 

Space Management system (OFMS) 

Directory service 

Unified voice and email messaging system 





Table 5: Current listeners. 


Description 






The name of the listener. 

The table that the listener (Proc_Name) is interested in. 
The subtype 1f applicable. 

Maybe a reason WHY it is interested. 


Table 6: Meta Change Interests definition. 
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and for each one, makes an entry in the Meta_ 
Change_Queue table (Table 7). The name of the listener 
is set in both the Queue_Name and Proc_Name fields. 
When a record is processed, the Queue_Name column 
will be set to null. By putting an index on this field, 
and clearing it when it has been processed, the calls to 
Get_Changes can be done very quickly and efficiently. 


Conclusions 


At present, we have seven distinct “listeners” 
waiting for changes in one or more of 16 defined tables 
and sub types. To date, this system has processed over 
a half million changes. The three “real time”’ polling 
processes do not appear to put any noticeable load on 
the database, and in fact we have several other similar 
polling processors handling password changes, and 
they also do not noticeably load our database server. 
The approach of using an index on a key column that is 
cleared when the record has been processed works 
very effectively, and we will continue to use that here 
and with other processes, such as our password syn- 
chronization for our “single signon.” We recently 
modified our password processing (described in [4]) to 
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re-encrypt a password change for additional authenti- 
cation realms (LDAP, Kerberos version 5, and our sec- 
ond Active Directory domain for the Unity Voice mail 
system.) 


The import table/trigger approach has been very 
handy in providing interactive response to some of our 
processes and will likely be our interface method of 
choice when dealing with new Oracle based vendor 
applications, as well as internally developed applica- 
tions where we want to maintain that clear demarca- 
tion between systems. 


Futures 


This Meta Change Queue system is fully opera- 
tional and well integrated with our environment. I don’t 
currently plan any major changes to it, but we will be 
making minor changes as new systems come along. 
XML Output 

Currently, each new listener required a listener 
specific interface package to be written. One area that 
may be worth exploring is a generic listener that gen- 
erates XML. This will most likely happen when we 


Description 


The name of the table (real or conceptual) that has been changed. 

The subtype of this table — if any. 

The rowid of the record that was changed. This may be useful in 
speeding processing. 

The type of change; “I” — Insert, ““C’”? — Change, ““D” — Deletion. 
Some indication of what happened to the record. 

The Simon.People.Id (if any). This is often a primary key for Simon 


A varchar2 primary key value, for tables that do not use Person_Id as 
their key. This is optional. 

A numeric primary key value, similar to Pkey String, only numeric 
rather than varchar2. 

An optional extra value that might be useful the receiving system. This 
might be the old name. 

The sysdate value when this change entry was made. 

An ever increasing sequence number. This can be used to order 


The date when this record was processed and could be cleared. 

The time and date when this record should again be made available for 
processing. This can be used by other systems that can’t process an 
event now, but want to get it eventually. Some other process will need 
to requeue these entries. 

The name of the listener who is waiting for this record. This is the 
trigger value for pending entries. This column is indexed, and once a 
record is processed, this should be set to null. This will keep the index 
small and fast, allowing for low overhead and frequent polls. 

The name of the listener. Initially, it is the same as the Queue Name, 
but Queue_Name will be cleared after processing, this helps us track 
which listener got this record. 


Field Type 
TNAME varchar2(32) 
SUBTYPE varchar2(32) 
RROWID rowid 
CHANGE TYPE varchar2(1) 
PERSON _ ID number 
tables. 
PKEY STRING varchar2(32) 
PKEY NUMBER number 
AUX_STRING varchar2(255) 
ENTRY DATE date 
ENTRY NUMBER | number 
changes. 
PROC DATE date 
HOLD_UNTIL date 
QUEUE NAME varchar2(8) 
PROC NAME varchar2(8) 
RETRY COUNT number 
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The number of times that this record was “‘put back”’ by the listener. 
This can help identify problem records and allow for back off options 
using the hold_until feature. 


Table 7: Meta Change Queue definition. 
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get a new system that can accept an update stream in a 
format like that. Given the existing examples and sup- 
port code, development of these interface packages 
has not been a problem. They are generally pretty sim- 
ple and straightforward. 


Status Reporting 


We now have listeners automatically tied into to 
our process monitoring system, which will report on 
overall system problems. However, we have not done 
much with record level feedback and error reporting. In 
general, problem records don’t get processed and cycle 
around for a while until someone notices them and takes 
appropriate action. This hasn’t been much of a problem, 
but is something we need to look at more closely. 


One of the objectives of my division, is to pro- 
vide metrics for our activities. We are currently log- 
ging some periodic summaries of changes, and more 
formal analysis and reporting would be desirable. 


Other Listeners 


New listeners are generally prompted by the 
arrival of new systems, and as these systems are gen- 
erally from other divisions, is not easy to predict what 
and when. We do have some existing systems that 
could benefit from the Meta Change Queue approach, 
and we will be exploring these areas. Some of them 
include: 

e DNS configuration — providing end user tools 
for DNS changes, with immediate changes 
going via the MCQ. 

e DHCP configuration — this has proven to be a 
‘“‘srowth area” as we need to implement ways of 
rapidly change DHCP configuration as the result 
of virus scans, abuse investigations and so on. 


Bulk Priority Queue 


I will be adding a low priority queue, that will 
allow bulk entries to flow when “real time” requests 
are not pending. This has become an issue when mass 
create jobs “lock up” a listener for a long time and 
interactive users are trying to work. This change will be 
done entirely within the Get_Changes routine and none 
of the listeners will need to be changed. 


References and Availability 


Some of the examples in this paper have been 
edited for publication, frequently, some of the error 
handling code has been removed. While this should 
not impact your understanding of how this works, if 
you are going to implement something like this, | 
would suggest looking at the actual source code to see 
some of the special cases that we had to deal with. 
Some are very site specific, but will give you some 
idea of some of the details we had to handle. 


This project is part of (but not dependent on) the 
Simon system, an Oracle based system used to assist 
in the management of our computer accounts [4], 
enterprise white pages [3], printing configuration [2], 
All source code for the Simon system, is available on 
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the web. See http:/Avww.rpi.edu/campus/rpi/simon/README. 
simon for details. In addition, all of the Oracle table 
definitions as well as PL/SQL package source are 
available at http://www.rpi.edu/campus/rpi/simon/misc/ 
Tables/simon.Index.html . 
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Solaris Zones: Operating System Support for 
Consolidating Commercial Workloads 


Daniel Price and Andrew Tucker — Sun Microsystems, Inc. 
ABSTRACT 


Server consolidation, which allows multiple workloads to run on the same system, has 
become increasingly important as a way to improve the utilization of computing resources and 
reduce costs. Consolidation is common in mainframe environments, where technology to support 
running multiple workloads and even multiple operating systems on the same hardware has been 
evolving since the late 1960’s. This technology is now becoming an important differentiator in the 
UNIX and Linux server market as well, both at the low end (virtual web hosting) and high end 
(traditional data center server consolidation). 


This paper introduces Solaris Zones (zones), a fully realized solution for server consolidation 
projects in a commercial UNIX operating system. By creating virtualized application execution 
environments within a single instance of the operating system, the facility strikes a unique balance 
between competing requirements. On the one hand, a system with multiple workloads needs to run 
those workloads in isolation, to ensure that applications can neither observe data from other 
applications nor affect their operation. It must also prevent applications from over-consuming 
system resources. On the other hand, the system as a whole has to be flexible, manageable, and 
observable, in order to reduce administrative costs and increase efficiency. By focusing on the 
support of multiple application environments rather than multiple operating system instances, 


zones meets isolation requirements without sacrificing manageability. 


Introduction 


Within many IT organizations, driving up system 
utilization (and saving money in the process) has 
become a priority. In the lean economic times follow- 
ing the post dot-com downturn, many IT managers are 
electing to adopt server consolidation as a way of life. 
They are trying to improve on typical data center 
server utilizations of 15-30% [1] while migrating to 
increasingly commoditized hardware. But the cost 
savings promised are not always realized [12]. Con- 
solidation can drive down initial equipment cost, but it 
can also increase complexity and recurring costs in 
several ways. In our experience, this has made many 
system administrators reluctant to embrace consolida- 
tion projects. We believe that when implemented 
effectively, consolidation can free system administra- 
tors and IT architects to pursue higher service levels, 
better overall performance, and other long term 
projects. With an appropriate solution, greater special- 
ization (and in turn, higher expertise) can be achieved; 
some administrators can focus on the maintenance of 
the physical platforms, and others can concentrate on 
the deployment of applications. 


Administrators currently lack an all-in-one solution 
for server consolidation, as existing solutions require 
administrators to purchase, author, or deploy additional 
software. This paper explains how a server consolidation 
facility tightly integrated with the core operating system 
can provide answers to these problems. 


Barriers to Consolidation 


Consolidation projects face a variety of technical 
problems. First and foremost, applications can be 
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mutually incompatible when run on the same server. 
In one real-world example, two poorly written appli- 
cations at a customer site both wanted to bind a net- 
work socket to port 80. While neither application was 
a substantial resource user, the customer resolved the 
conflict by buying two servers! Applications can also 
be uncooperative when administrators wish to run 
multiple instances of the same application on the same 
node. For example, dependencies on running as a par- 
ticular user ID can make it difficult to distinguish one 
running instance from another. Hard-coded log file 
locations or other pathnames can make it difficult to 
deploy two distinct versions of a particular application 
on the same node. At the highest level, solving this 
problem requires some form of namespace isolation, 
allowing administrators to make applications unaware 
of the presence of others. In the customer’s example, 
deploying to two separate OS instances running on 
two separate systems provides complete namespace 
isolation, but the cost is very high. 


A second technical problem faced by consolida- 
tors is security isolation. If multiple applications are to 
be deployed on a single host, what if there is a security 
bug in one of the applications? Even if each applica- 
tion is running under a different user ID (except for the 
applications that demand to run as root!), a wily intruder 
may be able to embark on a privilege escalation, in 
which the successively achieves higher levels of privi- 
lege until the entire system is compromised. If adminis- 
trators are unable to assess the extent of the damage, 
the consolidated system might require a rebuild. Ideally, 
one should be able to create namespaces that are at 
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fundamentally reduced levels of privilege: root in such 
an environment should be less powerful than the tradi- 
tional UNIX root. 


If a particular workload is compromised, a pro- 
tective mechanism that wards off denial of service and 
resource exhaustion attacks against the rest of the sys- 
tem should be in place. Similarly, consolidation 
projects must address quality of service guarantees 
and must often be able to account for resource utiliza- 
tion for billing or capacity planning purposes. Existing 
resource management solutions address many of these 
requirements by providing resource partitioning, 
advanced schedulers, and an assortment of resource 
caps, reservations, and controls. However, these facili- 
ties do not typically offer security and namespace iso- 
lation and in cases where both are available, they have 
not been closely integrated. 


A Comprehensive Solution 


While a range of solutions exists to each of the 
problems described above, we discovered that no 
comprehensive consolidation facility was available as 
a core component of a commonly available operating 
system. We determined that deeper integration and a 
more “baked in” facility for consolidation would 
allow administrators to approach consolidation 
projects without the burden of designing the infra- 
structure to do so from component pieces. As a design 
goal, we established that administrators should need 
only a few minutes and a very few configuration 
choices to instantiate and start up a new application 
container, which we dubbed a zone. We also wanted 
our project to be a pure software solution that would 
work on a variety of hardware platforms, with the 
least possible performance tax. 


At the highest level, zones are lightweight 
“sandboxes” within an operating system instance, in 
which one or more applications may be installed and 
run without affecting or interacting with the rest of the 
system. They are available on every platform on 
which Solaris 10 runs: AMD64, SPARC64, Ultra- 
SPARC, and x86. Applications can be run within 
zones with no changes, and with no significant perfor- 
mance impact for either the performance of the appli- 
cation or the base operating system. 


Outline 


This paper introduces zones and explains how 
we built a server consolidation facility directly into a 
production operating system, Solaris 10. The next sec- 
tions describe related work, an overview of the facil- 
ity, our design principles, and the architectural compo- 
nents of the project. The paper then explores specific 
aspects of the zones implementation, including 
resource management, observability and performance. 
We also discuss experiences to date with the facility. 


Related Work 


Much of the previous work on support for server 
consolidation has involved running multiple operating 
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system instances on a single system. This can be done 
either by partitioning the physical hardware compo- 
nents into disjoint, isolated subsets of the overall sys- 
tem [5, 8], or by using virtual machine technologies to 
create abstracted versions of the underlying hardware 
[2, 7, 14]. Hardware partitioning, while providing a 
very high degree of application isolation, is costly to 
implement and is generally limited to high-end sys- 
tems. In addition, the granularity of resource alloca- 
tion is often poor, particularly in the area of CPU 
assignment. Virtual machine implementations can be 
much more granular in how resources are allocated 
(even time-sharing multiple virtual machines on a sin- 
gle CPU), but suffer significant performance over- 
heads. With either of these approaches, the cost of 
administering multiple operating system instances can 
be substantial. 


More recently, a number of projects have 
explored the idea of virtualizing the operating sys- 
tem’s application execution environment, rather than 
the physical hardware. Examples include the Jails 
facility in FreeBSD [9] and the VServer project avail- 
able for Linux systems [13]. These efforts differ from 
virtual machine implementations in that there is only 
one underlying operating system kernel, which is 
enhanced to provide increased isolation between 
groups of processes. The result is the ability to run 
multiple applications in isolation from each other 
within a single operating system instance. This should 
result in reduced administration costs, since there is 
only one operating system instance to administer 
(patch, backup, etc.); in addition, the performance 
overhead is generally minimal. Such technologies can 
also be used to create novel system architectures, such 
as the distributed network testbed provided by the 
PlanetLab project [3]. 


These technologies can be used as “toolkits” to 
assemble point solutions to virtualization problems, 
but at present they lack the comprehensive support 
required for supporting commercial workloads. The 
barrier to entry for administrators is also high due to 
the lack of tools and integration with the rest of the 
operating system. 


Zones Overview 


Zones provides a solution which virtualizes the 
operating system’s application environment, and lever- 
ages the performance and sharing possible. At the 
same time, we have provided deeper and more com- 
plete system integration than is typical of such 
projects. We have been gratified when casual users 
mistake the technology for a virtual machine. This 
section provides a broad overview of the zones archi- 
tecture and operation. 


Figure | provides a block diagram of a system 
with four zones, representing a hypothetical consolida- 
tion. Zones red, neutral and lisa are non-global zones 
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running disjoint workloads. This example demon- 
strates that different versions of the same application 
may be run without negative consequences in different 
zones to match the consolidation requirements. Each 
zone can provide a rich (and different) set of cus- 
tomized services, and to the outside world, it appears 
that four distinct systems are available. Each zone has 
a distinct root password and its own administrator. 


Basic process isolation is also demonstrated; a 
process in one non-global zone cannot locate, exam- 
ine, or signal a process in another zone. Each zone is 
given access to at least one logical network interface; 
applications running in distinct zones cannot observe 
the network traffic of the other zones even though 
their respective streams of packets travel through the 
same physical interface. Finally, each zone is provided 
a disjoint portion of the file system hierarchy, to which 
it is confined. 


The global zone encloses the three non-global 
zones and has visibility into and control over them. 
Practically speaking, the global zone is not different 
from a traditional UNIX system; root generally 
remains omnipotent and omniscient. The global zone 
always exists, and acts as the “default” zone in which 
all processes are run if no non-global zones have been 
created by the administrator. 


We use the term global administrator to denote a 
user with administrative privileges in the global zone. 
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This user is assumed to have complete control of the 
physical hardware comprising the system and the 
operating system instance. The term zone administra- 
tor is used to denote a user with administrative privi- 
leges who is confined to the sandbox provided by a 
particular non-global zone. 


Managing zones is not complicated. Figure 2 
shows how to create a simple, non-networked zone 
called Jisa with a file system hierarchy rooted at 
/auxO/lisa, install the zone, and boot it. Booting a zone 
causes the init daemon for the zone to be launched. At 
that point, the standard system services such as cron, 
sendmail, and inetd are launched. 


Design Principles 


This section and the next examine the zones 
architecture in greater depth; before doing so it helps 
to examine the design principles we applied. First and 
foremost, our solution must solve consolidation prob- 
lems such as those highlighted in the first section. The 
solution must provide namespace isolation and 
abstraction, security isolation, and resource allocation 
and management. 


Second, the facility must support commercial 
applications: these are often scalable, threaded, highly 
connected to the network via TCP/IP, NFS, LDAP, etc. 
These applications come with installers and usually 
interoperate with the packaging subsystem on the host. 
More importantly, because these applications are often 


lisa zone (lisa.usenix.org) 
zone root: /auxO/lisa 
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Figure 1: Zones block diagram. 
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Opaque in operation, they should work ‘‘out of the 
box” within a zone whenever possible. Software 
developers should not need to modify applications, 
and administrators should not need to develop script- 
ing wrappers or have a deep understanding of UNIX 
internals to deploy these applications. Similarly, 
administrators interacting with this facility should be 
pilots, not mechanics. As much as possible, system 
administrators should be able to view the application 
environment as a vehicle for deploying applications, 
not as a collection of parts to assemble. Setup should 
be simple and the entire system should look and feel 
as much like a normal host as possible. In addition, the 
solution should enable delegation wherever possible. 
The administrator of the global zone should be able to 
configure the overall system and delegate further con- 
trol to zone administrators. 


By exploiting sharing and semantics inside a sin- 
gle operating system instance, we can support a large 
number of application environments with relatively few 
resources. Operating in a shared environment means 
that monitoring application environments can be per- 
formed transparently. For example, from the pilot’s seat, 
we should immediately be able to tell which process on 
the system (regardless of the application environment in 
which it runs) is using the most CPU cycles. 


The solution must scale and perform with with 
the underlying platform. A 64-CPU application envi- 
ronment should “just work,” as should the deploy- 
ment of 20 environments on a 1-CPU system. 
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Additionally, the solution should levy little or no per- 
formance tax on applications run inside it. Finally, 
minimal performance impact should be present on a 
system with no application environments. 


To address these design principles, we divided 

the zones architecture into five principal components. 
e A state model that describes the lifecycle of the 
zone, and the actions that comprise the transi- 
tions. 
¢ A configuration engine, used by administrators 

to describe the future zone to the system. This 
allows the administrator to describe the “‘plat- 
form,” or those parameters of the zone that are 
controlled by the global administrator, in a per- 
sistent fashion. 
Installation support, which allows the files that 
make up the zone installation to be deployed 
into the zone path. This subsystem also enables 
patch deployment and upgrades from one oper- 
ating system release to another. 
The application environment, the “sandbox” in 
which processes run. For example, in Figure 3 
each zone’s application environment is repre- 
sented by the large shaded box. 
The virtual platform, comprised of the set of 
platform resources dedicated to the zone. 


We’ll explore these subsystems in more depth in 
subsequent sections. 


Zones State Model 


A well-formed, observable state model that 
describes the zone lifecycle is an important part of the 
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## zonecfg -z lisa ’create; set zonepath=/aux0/lisa’ 
## zoneadm list -vc 


ID NAME STATUS PATH 
0 global running / 
+ lisa configured /aux0/lisa 


## zoneadm -z lisa install 
Constructing zone at /aux0/lisa/root 
Copying packages and creating contents file 


## zoneadm list -vc 


ID NAME STATUS PATH 
0 global running / 
- lisa installed /aux0/lisa 


+f zoneadm -z lisa boot 
+f zoneadm list -vc 


ID NAME STATUS PATH 
0 global running / 
7 lisa running /aux0/lisa 


# zlogin lisa 

[Connected to zone ’lisa’ pts/7] 
zone: lisa 

# ptree 

1716 /sbin/init 

1769 /usr/sbin/cron 

L775 /usr/lib/sendmail -Ac -q15m 
1802 /usr/lib/ssh/sshd 


Figure 2: Zones administration. 
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pilot model design principle; it makes it easier for 

administrators to manage the zones present on the sys- 

tem. Figure 3 illustrates the zone state model. While 
this is of interest to the global administrator, zone 

administrators need not be aware of these states. A 

zone can be in one of four primary states, or in one of 

several secondary, or transitional, states: 

CONFIGURED: A zone’s configuration has been com- 
pletely specified and committed to stable storage. 

INSTALLED: Based on the zone’s configuration, a 
unique root file system for the zone has been 
instantiated on the system. 

READY: At this stage, the virtual platform for the zone 
has been established: the kernel has created the 
zsched process, network interfaces have been 
plumbed, file systems mounted, and devices con- 
figured. At this point, there are no user processes 
associated with the zone (zsched is a system 
process, and lacks a user address space). 

RUNNING: The init daemon has been created and 
appears to be running. init will in turn start the 
rest of the processes that comprise the applica- 
tion environment. 

SHUTTING DOWN: The zone is transitioned into this 
state when either the global or non-global zone 
administrator elects to reboot or halt the zone. 
The zone remains in this state until all user pro- 
cesses associated with the zone have been 
destroyed. 

DOWN: The zone remains in this state until the virtual 
platform has been completely destroyed: filesys- 
tems and NFS shares are unmounted, IPC objects 
destroyed, network interfaces unplumbed, etc. At 
that point the zone returns to the INSTALLED state. 


uninstall 
Configured ~< Installed 
install 7 A 
halt ready 
reboot 
halt 
y 
Running ~< 2% Ready 


Figure 3: Zones state model. 


Configuration Engine 


Zones present a simple and concise configuration 
experience for system administrators. A command 
shell, zonecfg, is used by the global administrator to 
configure the zone’s properties and describe the zone 
to the system. The tool can be used in interactive 
mode or scripted to create a new zone or edit existing 
zones. The configuration includes information about 
the location of the zone in the file system, IP 
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addresses, file systems, devices, and resource limits. 
The zone configuration is retained by the system in a 
private repository (presently, an XML file), and keyed 
by zone name. 


The design of zonecfg was challenging: Zone 
configurations can be complex, but we wanted to 
instantiate a new zone with a minimum number of 
commands and without having to navigate through a 
complex configuration file. Ultimately, the only 
mandatory parameter is the zonepath — the location in 
the file system where the zone should be created. 


Installation Support 


The zones installer is an extension to the Solaris 
install and packaging tools. An important goal was to 
be able to create a zone on an existing system, without 
needing to consult installation media. Binary files 
such as /usr/bin/Is can simply be copied from the global 
zone, or imported to the zone using a loopback mount 
to save disk space. 


Files which are customizable by an administra- 
tor, such as /etc/passwd, must not be copied from the 
global zone into the zone being installed. Such files 
must be restored to their ‘“‘factory default” condition. 
In order to accomplish this, the installer archives pri- 
vate, pristine copies of such volatile and editable sys- 
tem files when the global zone itself is installed or 
upgraded. The zones installer uses these archived ver- 
sions when populating zones. 


Because the zone installer is package-aware, the 
end result of zone installation is a virtual environment 
with an appropriately populated package database. 
This means that packaging utilities such as pkgadd can 
be used by the zone administrator to add or patch 
unbundled or third-party software inside the zone 
while also allowing the global administrator to the cor- 
rectly upgrade and patch the system as a whole. 


Application Environment 


The application environment forms the core of 
the zones implementation. Using the facilities it pro- 
vides, other subsystems such as NFS, TCP/IP, file sys- 
tems, etc. have been “virtualized,” that is, rearchi- 
tected to be compatible with the zones design. 


At the most basic level, the kernel identifies spe- 
cific zones in the same fashion as it does processes, by 
using a numeric ID. The zone JD 1s reflected in the 
cred and proc structures associated with each process. 
The kernel can thus easily and cheaply determine the 
zone membership of a particular process. This map- 
ping is at the heart of the implementation. We have 
also found that virtualizing kernel subsystems (for 
example, process accounting) is often not terribly dif- 
ficult if the subsystem’s global variables are lifted up 
into a per-zone data structure (in other cases, such as 
TCP/IP, the virtualization required is more pervasive). 


The process of booting the application environ- 
ment is similar to the late stages of booting the operat- 
ing system itself. In the kernel, a special process, 
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zsched, is created. This mimics the traditional UNIX 
process 0, sched. When seen from inside a zone, 
zsched is at the root of the process tree. zsched also 
acts as a container for a variety of per-zone data that is 
hard to express in other ways. RPC thread pools and 
other per-zone kernel threads, as well as resource con- 
trols and resource pool bindings, are handled in this 
fashion. Next, the init daemon is formed, associated 
with the zone, and exec’d to set it running in 
userspace; init then initiates the process of starting up 
other services that make the zone behave like a stand- 
alone computer system. 


Zones are also assigned unique identities. The 
zone name, which is used to label and identify the 
zone, is assigned by the global administrator. Control 
of the node name, RPC domain name, Kerberos con- 
figuration, locale, time zone, root password and name 
service configuration is entirely delegated to the zone 
administrator. When a zone is first booted, the zone 
administrator is stepped through the process of setting 
up this configuration via an interactive tool. 


Security concerns are central to the design of the 
application environment. Fundamentally, a zone is less 
powerful than the global environment, because zones 
take advantage of the fine-grained privilege mecha- 
nism available in Solaris 10 [11]. This mechanism 
changes the traditional all-or-nothing “‘super-user”’ 
privilege model into one with distinct privileges that 
can be individually assigned to processes or users.’ A 
zone runs with a reduced set of privileges, and this 
helps to ensure that even if a process could find a way 
to escape namespace isolation enforced by the zone, it 
would still be constrained from escalating to higher 
privilege. For example, writing to /dev/kmem requires 
all privileges. All non-global zone processes and their 
descendants have fewer than all privileges, and are 
constrained from ever achieving all privileges, so the 
kernel will never allow such a process to write to 
dev/kmem. The namespace isolation facilities provided 


‘This is similar to the capability feature available in Linux. 


# zlogin -C lisa 


[Connected to zone ’lisa’ console] 
lisa console login: root 

Password: 

4 reboot 


Aug 13 14:44:07 lisa reboot: 
[NOTICE: Zone rebooting] 


SunOS Release 5.10 Version s10_65 64-bit 
Inc. All rights reserved. 


Copyright 1983-2004 Sun Microsystems, 
Use is subject to license terms. 
Hostname: lisa 

NIS domain name is usenix.org 


lisa console login: 


[Connection to zone ’lisa’ 
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by zones coupled with privilege containment provide a 
sound double-hulled architecture for secure operation. 
Although the set of privileges available in a zone is 
currently fixed, we plan to make this configurable in 
the future; this will allow administrators to create spe- 
cial-purpose zones with only the minimal set of privi- 
leges needed to run a particular service. 


One significant design challenge the project 
faced was: how can we cross the boundary between 
global and non-global zones in a safe fashion? We 
authored the zlogin utility to allow global administra- 
tors to descend into specific zones; this command is 
modeled after familiar utilities such as rlogin. The 
process of transferring a running process from one 
zone to another is complex, and was a challenging 
aspect of the implementation. We took care to prevent 
any data from “leaking” from the global zone into 
non-global zones; this required sanitization of parent 
process IDs, process group IDs, session IDs, creden- 
tials, fine-grained privileges, core file settings, and 
other process model-related attributes. Processes 
whose parent process lies outside the zone (as is the 
case with zlogin to a zone) are faked within the zone to 
have zsched’s PID as their parent process ID. Simi- 
larly, signals sent from the global zone to non-global 
zone processes appear to originate from zsched. 


Virtual Platform 


The virtual platform is the “bottom half” of a 
zone. Conceptually, it is comprised of the physical 
resources that have been made available to the zone. 
The virtual platform is also responsible for boot, reboot 
and halt, and is managed by the zoneadmd daemon. 


The virtual platform takes a snapshot of the zone 
configuration from the configuration engine and fol- 
lows the plan it provides to bring the zone into the 
READY state. This involves directing the kernel to cre- 
ate the central zone_t data structure and the zsched ker- 
nel process, setting up virtual network interfaces, pop- 
ulating devices, creating the zone console, and direct- 
ing any other pre-boot housekeeping. 


rebooted by root 


console closed] 


Figure 4: Zones console. 
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Uniquely, the zone console can exist even before 
the zone is in the ready state. This mimics a serial con- 
sole to a physical host, which can be connected even 
when a machine is halted, and it provides a familiar 
experience for administrators. The console itself is a 
STREAMS driver instance that is dynamically instantiated 
as needed. It shuttles console I/O back and forth from 
the zone (via /dev/console) to the global zone (via 
/dev/zcons/<zonename>/masterconsole). zoneadmd then 
acts as a console server to the zlogin -C command. Fig- 
ure 4 shows a typical console session. We found that 
the zone pseudo-console was a key to helping users see 
that a zone is a substantially complete environment, 
and perhaps more importantly, a familiar environment. 


Virtualization of Specific Subsystems 


One of the principal challenges of the zones 
project was making decisions about the “virtualization 
strategy” for each kernel subsystem. Generally, we 
sought to allow the global administrator to observe and 
control the entire system. But this was not always pos- 
sible due to API restrictions (for example, APIs dictated 
by a particular standard), implementation constraints, or 
other factors. The next sections detail the virtualization 
that was required for each primary kernel subsystem. 


Process Model 


One of the basic principles of zones is that pro- 
cesses in non-global zones should not be able to affect 
the activity of processes running within another zone. 
This also extends to visibility; processes within one 
(non-global) zone should not even be able to see pro- 
cesses outside that zone, and by extension should not 
be able to observe the activity of such processes. This 
is enforced by restricting the process ID space 
exposed through the /proc file system and process-spe- 
cific system calls such as kill, priocntl, and signal. If the 
calling process is running within a non-global zone, it 
will only be able to see or affect processes running 
within the same zone; applying the operations to 
process IDs in any other zone will return an error. The 
error code is the same as the one returned when the 
specified process does not exist, to avoid revealing the 
fact that the selected process ID exists in another zone. 
This policy also ensures that an application running in 
a zone sees a consistent view of system objects; there 
aren’t any objects that are visible through some means 
(e.g., when probing the process ID space using kill) but 
not others (e.g., /proc). 


The dual role of the global zone, acting as both 
the default zone for the system and as the nexus of sys- 
tem-wide administrative control, raises some interest- 
ing issues. Since applications within the zone have 
access to processes and other system objects in other 
zones, the effect of administrative actions may be wider 
than expected. For example, service shutdown scripts 
often use pkill to signal processes of a given name to 
exit. When run from the global zone, all such processes 
in the system, regardless of zone, will be signaled. 
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On the other hand, the system-wide scope is 
often desired. For example, an administrator monitor- 
ing system-wide resource usage would want to look at 
process statistics for the whole system. A view of just 
global zone activity would miss relevant information 
from other zones in the system that may be sharing 
some or all of the system’s resources. Such a view is 
particularly important when the use of relevant system 
resources such as CPU, memory, swap, and I/O is not 
strictly partitioned between zones using resource man- 
agement facilities. 


We chose to allow any processes in the global 
zone to observe processes and other objects in non- 
global zones. This allows such processes to have sys- 
tem-wide observability. The ability to control or send 
signals to processes in other zones, however, 1s 
restricted by a fine-grained privilege, PRIV_PROC_ 
ZONE. By default, only the root user in the global zone 
is given this privilege. This ensures, for example, that 
user tucker, whose user ID in the global zone is 1234, 
cannot kill processes belonging to user dp, whose user 
ID in the lisa zone is also 1234. Because different 
zones on the same system can have completely differ- 
ent name service configurations, this is entirely possi- 
ble. The root user can also drop this privilege, restrict- 
ing activity in the global zone to affect only processes 
in that zone. 


Accounting and Auditing 


Process and workload accounting provide an 
excellent example of both the challenges and opportu- 
nities for retrofitting virtualization into an existing 
subsystem. Accounting outputs a record of each 
process to a file upon its termination. The record typi- 
cally includes the process name, user ID, exit status, 
statistics about CPU usage, and other billing-related 
items. The UNIX System V accounting subsystem, 
which remains in wide usage, employs fixed size 
records that cannot be extended with new fields. Thus, 
we modified the system so that accounting records 
generated in any zone (including the global zone) only 
contain records pertinent to the zone in which the 
process executed. System V accounting can be 
enabled or disabled independently for each zone. 


In addition, since Solaris 8, the system has pro- 
vided a modernized ‘‘extended accounting” facility, 
with flexible record sizes. We modified this so that 
records are now tagged with the zone name in which 
the process executed, and are written both to that 
zone’s accounting stream and to the global zone; this 
provides an important facility for consolidation, and, 
uniquely, the ability to account in detail for the activ- 
ity of the application environment. The set of data col- 
lected, the location of the accounting record files, and 
other accounting controls may all be configured inde- 
pendently per-zone. 


The Solaris security auditing facility has been 
similarly updated with the addition of a zonename 
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token. An audit record describes an event, such as 
writing to a file, and the stream of audit records is 
written to disk and may be processed later. Each zone 
can access the appropriate subset of the audit trail, and 
the global zone can see all audit records for all zones. 
Because the global zone can track audit events by zone 
name, a complete record of auditable events can be 
generated per-zone. We think this represents an excit- 
ing possibility for intrusion detection and analysis. 


IPC Mechanisms 


Local inter-process communication (IPC) repre- 
sents a particular problem for zones, since processes in 
different (non-global) zones should normally only be 
able to communicate via network APIs, as would be 
the case with processes running on separate machines. 
It might be possible for a process in the global zone to 
construct a way for processes in other zones to com- 
municate, but this should not be possible without the 
participation of the global administrator. 


IPC mechanisms that use the file system as a ren- 
dezvous, such as pipes, STREAMS, UNIX domain 
sockets, doors, and POSIX IPC objects, fit naturally 
into the zone model without modification since pro- 
cesses in one zone will not have access to file system 
locations associated with other zones. Because the file 
system hierarchy is partitioned, there is no way for 
processes in a non-global zone to achieve rendezvous 
with another zone without the involvement of the 
global zone (which has access to the entire hierarchy). 


The System V IPC interfaces allow applications 
to create persistent objects (shared memory segments, 
semaphores, and message queues) for communication 
and synchronization between processes on the same 
system. The objects are dynamically assigned numeric 
identifiers that can be associated with user-defined 
keys, allowing usage of a single object in unrelated 
processes. Objects are also associated with an owner 
(based on the effective user ID of the creating process 
unless explicitly changed) and permission flags that 
can be set to restrict access when desired. In order to 
prevent sharing (intentional or unintentional) between 
processes in different zones, a zone ID is associated 
with each object, based on the zone in which the creat- 
ing process was running at time of creation. Non- 
global zone processes are only able to access or con- 
trol objects associated with the same zone. An admin- 
istrator in the global zone can still manage IPC objects 
throughout the system without having to enter each 
zone. The key namespace is also virtualized to be per- 
zone, which avoids the possibility of key collisions 
between zones. 


Networking 


As discussed earlier, each zone is configured 
with one or more IP addresses. For each address 
assigned, a logical network interface is created in the 
global zone when the zone is readied. This address is 
then assigned to the zone. The system as a whole 
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looks like a traditional multi-home server, but inter- 
nally the IP stack partitions the networking between 
zones in much the same way as it would be partitioned 
between separate servers. From the perspective of an 
external network observer, a system with booted zones 
appears to be set of separate servers. 


Each IP address and its associated logical inter- 
face are dedicated for use by the assigned zone. Only 
processes within the zone can send packets from that 
address or receive packets sent to that address. Logical 
interfaces can share a physical network interface, 
however, so depending on how the zones are config- 
ured, different zones may wind up sharing network 
bandwidth on a single physical interface. The isolation 
of network traffic means that services such as send- 
mail, Apache, etc., can be run in different zones without 
worrying about IP port conflicts. 


Applications in different zones on the same sys- 
tem can communicate using conventional networking, 
just as applications on different systems can communi- 
cate. This traffic is “short-circuited” within the IP stack 
rather than sending data over the wire, minimizing the 
communication overhead. One drawback is that exist- 
ing firewalling products are not able to filter or other- 
wise act on cross-zone traffic, as it is handled entirely 
within [P and is not visible to any underlying fire- 
walling products. We hope to remedy this in the future. 


Sending and receiving broadcast and multicast 
packets is supported within any zone. Inter-zone 
broadcast and multicast is implemented by replicating 
outgoing and incoming packets as necessary, so that 
each zone that should receive a broadcast packet or 
each zone that has joined a particular multicast group 
receives the appropriate data. 


Access to the network by non-global zones is 
restricted. The standard TCP and UDP transport inter- 
faces are available, but some lower level interfaces, 
such as raw socket access (which allows the creation 
of IP packets with arbitrary contents) and DLPI are 
not. These restrictions are in place to ensure that a 
zone cannot gain uncontrolled access to the network, 
where it might be able to behave in undesirable ways. 
For example, a zone cannot masquerade as a different 
zone or host on the network. Access to ICMP is also 
supported, allowing popular utilities such as traceroute 
and ping to work properly. 

The zones facility also provides support for man- 
ual configuration of IPv6 addresses, with support for 
automatic addressing planned for the future. Because 
much of the TCP/IP infrastructure is shared between 
all zones, some functionality is automatically sup- 
ported and can be configured on behalf of a zone by 
the global administrator. For example, if IP Multi- 
pathing is configured within the global zone, the logi- 
cal interfaces associated with a failed physical inter- 
face are automatically moved to a configured alternate 
interface. The individual zones do not need any 
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configuration to support this, and are not even aware 
of the failure. 


IPsec and IPQoS facilities can be configured on 
behalf of a zone by the global administrator; in the 
future we hope to allow global administrators to dele- 
gate some of this configuration to non-global zones. It 
would also be convenient to provide DHCP client sup- 
port so that the global zone could request IP addresses 
for non-global zones from a DHCP server, and work 
to incorporate this support is underway. 


File Systems 


We have seen that zones are rooted at a particular 
point in the file system. This is implemented in a fash- 
ion similar to the chroot system call, although that 
call’s well known security limitations [6] are avoided 
and the zone is not escapable. Because a different 
mechanism is used, use of chroot is even possible 
within a zone. 


When the zone boots, the configuration engine is 
consulted for a list of file systems to mount on behalf 
of the zone. These can include storage-backed file sys- 
tems as well as pseudo-file systems. In particular, lofs, 
the Solaris loopback file system, provides a useful tool 
for constructing a file system namespace for a zone. It 
can be used to mount segments of a file system in 
multiple places within the namespace. For example, 
the /usr file system is typically loopback mounted 
read-only beneath the zone root. This results in a high 
degree of sharing of storage, and a freshly installed 
zone requires only about 60 MB of disk space. The 
use of loopback mounts also results in the sharing of 
process text pages in the buffer cache, further decreas- 
ing the impact of running large numbers of zones. 
However, this approach adds substantial complexity to 
the design and implementation of the packaging tools. 
For example, the zone installation software must be 
aware that a particular file system object such as 
/usr/bin/Is will be available, but it will not have to be 
copied to the zone’s /usr file system. 


Mounts require special handling as well. In 
Solaris /etc/mnttab is itself a mounted file system that, 
when mounted, exports the typical /etc/mnttab file. The 
mnttab handling code was modified so that each zone 
sees only the mounts accessible by it. As usual, the 
global zone can see everything. 


A key security principle is that the global zone 
users should not be able to traverse the file system 
hierarchy of non-global zones. Allowing this would 
enable unprivileged users in the global zone to collabo- 
rate with root users in non-global zones. For example, a 
zone’s root user might mark a binary in a zone setuid 
root, and collaborate with a non-root user in the global 
zone, who could then run the binary and gain superuser 
privileges. As such, the the zones infrastructure 
enforces that the zone root’s parent directory be owned, 
readable, writable, and executable by root only. We 
were also careful to prevent zones components such as 
zlogin and zoneadmd from ever accessing files residing 
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within zones, in order to avoid “traps” that might have 
been placed by privileged software within the zone. 


Devices 


A limited set of devices, accessible via the /dev 
hierarchy, are available to zones by default. Addition- 
ally, some devices required additional virtualization to 
support zones. The syslog device is a good example: 
each zone has a distinct /dev/log device with a separate 
message stream, so that syslog(3C) messages are deliv- 
ered to the syslogd in the zone that generated them. 


Administrators can use the configuration engine 
to add additional devices to the zone as needed. This 
carries additional security risk because device inter- 
faces are relatively unconstrained. A single device 
driver can form its own subsystem of APIs and seman- 
tics. For example, writing to a disk, writing to /dev/null, 
and writing to /dev/kmem all have completely different 
effects and security implications. As a general princi- 
ple, we discourage the placement of physical devices 
into zones, as there is wide opportunity for mischief. 
For example, disks or disk partitions can be assigned 
to a zone, but the preferred method is for the global 
administrator to assign only file systems, which pro- 
vide more uniform, auditable semantics. 


A driver bug or improperly guarded feature could 
allow a hacker to attack the kernel. As a result, all of 
the devices included in a zone by default were audited 
and tested for security problems. We also addressed 
more systemic security problems; for example, an 
imported device node may allow a hacker to attack the 
system. For this reason, zones lack the privilege to call 
mknod(2) to create device nodes. However, this prob- 
lem is more pervasive. If a hacker caused an NFS 
server to export a device node that matched the major 
and minor number of /dev/kmem and caused the zone 
to mount this share, then the system could be compro- 
mised. To defend against this attack, all mounts initi- 
ated from within a zone are guarded by the nodevices 
mount option, which prevents the opening of device 
nodes present on the mount. Note that even without 
nodevices, such an attack would remain difficult, as the 
reduced privilege allotted to the zone does not allow 
writing to the kmem device under any circumstances. 


A final category of attacks could be carried out 
against the software managing the /dev hierarchy that 
runs in the global zone as part of the virtual platform. 
In this case, both global and non-global zones require 
access to the /dev hierarchy. The solution is to build 
and manage /dev for the zone outside of the zone’s file 
system hierarchy, and then use the lofs file system to 
loopback mount /dev into the zone. Additionally, the 
kernel prohibits the zone from making all but the most 
basic modifications to its /dev hierarchy. Permission, 
group, and ownership changes are permitted; other file 
system operations are not. 


NFS 


Virtualizing client-side NFS support presents a 
somewhat unique challenge. NFS is not only a file 
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system: It also has semantics that are dependent on the 
network identity (hostname, RPC domain, etc.) of the 
client. For example, an NFS share may be exported 
solely to a client with a specific host name. Since each 
zone has a separate network identity, NFS mounts in 
different zones on the same system must be handled 
separately. In particular, operations to file system 
mounts associated with a zone must have matching 
credentials. This allows lower-level code (such as the 
RPC transport code) to keep track of the zone associ- 
ated with a specific operation, even if that operation is 
being performed asynchronously. As a consequence, 
NFS mounts in a non-global zone cannot be accessed 
from the global zone. 


Another complication is the use of kernel 
threads. The Solaris NFS implementation maintains a 
pool of in-kernel threads to asynchronously read- 
ahead data before it is needed, which improves perfor- 
mance when large files are read sequentially. When 
multiple zones can be using NFS, the thread pools 
need to be maintained on a per-zone basis. This allows 
the number of threads in each pool to be managed 
independently (since different zones may have differ- 
ent requirements with respect to concurrency) and 
means that threads can be assigned credentials associ- 
ated with the appropriate zone. 


Resource Management 


Most of the prior discussion has described the 
ways in which zones can be used to isolate applica- 
tions in terms of configuration, namespace, security, 
and administration. Another important aspect of isola- 
tion is ensuring that each application receives an 
appropriate proportion of the system resources: CPU, 
memory, and swap space. Without such a capability, 
one application can either intentionally or unintention- 
ally starve other applications of resources. In addition, 
there may be reasons to prioritize some applications 
over others, or adjust resources depending on dynamic 
conditions. For example, a financial company might 
wish to give a stock trading application high priority 
while the trading floor is open, even if it means taking 
resources away from an application analyzing overall 
market trends. 


The zones facility is tightly integrated with exist- 
ing resource management controls available in Solaris 
[10]. These controls come in three flavors: entitle- 
ments, which ensure a minimum level of service; /im- 
its, which bound resource consumption; and parti- 
tions, which allow physical resources to be exclu- 
sively dedicated to specific consumers. Each of these 
types of controls can be applied to zones. For exam- 
ple, a fair-share CPU scheduler can be configured to 
guarantee a certain share of CPU capacity for a zone. 
In addition, an administrator within a zone can config- 
ure CPU shares for individual applications running 
within that zone; these shares are used to determine 
how to carve up the portion of CPU allocated to the 
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zone. Likewise, resource limits can be established on 
either a per-zone basis (limiting the consumption of 
the entire zone) or a more granular basis (individual 
applications or users within the zone). In each case, 
the global zone administrator is responsible for config- 
uring per-zone resource controls and limits, while the 
administrator of a particular non-global zone can con- 
figure resource controls within that zone. 


i =< » ¥ 
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Figure 5: Zones and the fair-share scheduler. 


Figure 5 shows how the fair-share CPU sched- 
uler can be used to divide CPU resources between 
zones. In the figure, the system is divided into four 
zones, each of which is assigned a certain number of 
CPU shares. If all four zones contain processes that 
are actively using the CPU, then the CPU will be 
divided according to the shares; that is, the red zone 
will receive 1/7 of the CPU (since there are a total of 
seven shares outstanding), the neutral zone will receive 
2/7, etc.. In addition, the lisa zone has been further 
subdivided into five projects, each of which represent 
a workload running within that zone. The 2/7 of the 
CPU assigned to the lisa zone (based on the per-zone 
shares) will be further subdivided among the projects 
within that zone according to the specified shares. 


Resource partitioning is supported through a 
mechanism called resource pools, which allows an 
administrator to specify a collection of resources that 
will be exclusively used by some set of processes. 
Although the only resources initially supported are 
CPUs, this is planned to later encompass other system 
resources such as physical memory and swap space. A 
zone can be “bound” to a resource pool, which means 
that the zone will run only on the resources associated 
with the pool. Unlike the resource entitlements and 
limits described above, this allows applications in dif- 
ferent zones to be completely isolated in terms of 
resource usage; the activity within one zone will have 
no effect on other zones. This isolation is furthered by 
restricting the resource visibility. Applications or users 
running within a zone bound to a pool will see only 
resources associated with that pool. For example, a 
command that lists the processors on the system will 
list only the ones belonging to the pool to which the 
zone is bound. Note that the mapping of zones to 
pools can be one-to-one, or many-to-one; in the latter 
case, multiple zones share the resources of the pool, 
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and features like the fair-share scheduler can be used 
to control the manner in which they are shared. 


Figure 6 shows the use of the resource pool facil- 
ity to partition CPUs among zones. Note that pro- 
cesses in the global zone can actually be bound to 
more than one pool; this is a special case, and allows 
the use of resource pools to partition workloads even 
without zones. Non-global zones, however, can be 
bound to only one pool (that is, all processes within a 
non-global zone must be bound to the same pool). 


Performance and Observability 


As noted in the related work section, one of the 
advantages of technologies like zones that virtualize 
the operating system environment over a traditional 
virtual machine implementation is the minimal perfor- 
mance overhead. In order to substantiate this, we have 
measured the performance of a variety of workloads 
when running in a non-global zone, when compared to 
the same workloads running without zones (or in the 
global zone). This data is shown in Figure 7 (in each 
case, higher numbers represent a faster run). The final 
column shows the percentage degradation (or 
improvement) of the zone run versus the run in the 
global zone. As can be seen, the impact of running an 
application in a zone is minimal. The 4% degradation 
in the time-sharing workload is primarily due to the 
overhead associated with accessing commands and 
libraries through the lofs file system. 


Workload Base Zone Diff (%) 
Java 38.45 38.29 99.6 
Time-sharing 23332.58  22406.51 96.0 
Networking 283.30 284.24 100.3 
Database 38767.62 37928.70 97.8 


Figure 7: Performance impact of running in a zone. 


We also measured the performance of running multi- 
ple applications on the system at the same time in dif- 
ferent zones, partitioning CPUs either with resource 
pools or the fair share scheduler. In each case, the per- 
formance when using zones was equivalent, and in 
some cases better, than the performance when running 
each application on separate systems. 


Since all zones on a system are part of the same 
operating system instance, processes in different zones 
can actually share virtual memory pages. This is partic- 
ularly true for text pages, which are rarely modified. 
For example, although each zone has its own init 
process, each of those processes can share a single copy 
of the text for the executable, libraries, etc.. This can 
result in substantial memory savings for commonly 
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used executables and libraries such as libc. Similarly, 
other parts of the operating system infrastructure, such 
as the directory name lookup cache (or DNLC), can be 
shared between zones in order to minimize overheads. 


Observability Tools and Debugging 


Because of the transparent nature of zones, all of 
the traditional Solaris /proc tools may be applied to 
processes running inside of zones, both from inside 
the non-global zone, and from the global zone. Addi- 
tionally, numerous utilities such as ps, priocntl, ipcs, 
and prstat (shown in Figure 9) have been enhanced for 
zone-awareness. 


In addition, we were able to enhance the DTrace 
[4] facility to provide zone context. In the following 
example, we can easily discover which zone is causing 
the most page faults to occur; see Figure 11. 


We were pleasantly surprised when a customer 
pointed out to us that he could employ zones and 
DTrace together to better understand and debug a 
three-tiered architecture by deploying the tiers 
together on a single host in separate zones in the test 
environment, and making specific queries using 
DTrace. 


Experience 


Zones is an integrated part of the Solaris 10 oper- 
ating system, which is still under development. 
Through pre-release programs, Zones has seen adop- 
tion both within Sun and by a variety of customers. 


In one ‘pilot’ deployment, Sun’s IT organization 
has consolidated a variety of business applications. A 
four-CPU server with six non-global zones is hosting: 

e Zone 1 The web front-end (Java System Web 
Server version 6.1) to Sun’s host database. 

e Zone 2 The web front-end (Java System Web 
Server version 6.0) to the ‘orgtool’ website, 
providing Sun’s online organization chart. 

e Zone 3 The Oracle database that provides the 
backend for Sun’s online organization chart. 

e Zone 4 A database reporting tool, which inter- 
faces with Peoplesoft and corporate tax data- 
bases; this is monitored by software from 
TeamQuest. 

e Zone 5 A security hardened CVS server, using 
LDAP and DNS name services (the other zones 
use NIS). 

e Zone 6 A Sun-internal application that utilizes 
Apache and MySQL. 


This consolidation is probably typical of both 
large and small IT organizations; a wide variety of 
heterogeneous software (including different versions 


# dtrace -n ‘'vminfo:::as_fault{@[zonename]=count ()}’ 
dtrace: description ’vminfo:::as_fault’ matched 1 probe 


mg 
global 4303 
lisa 29867 


Figure 8: Enhanced DTrace facility with zone context. 
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of the same application) is in play. In order to provide 
more predictable quality of service, the deployment 
team assigned different amounts of CPU shares to the 
various zones, to represent the relative importance of 
each workload. 


| | global | 


red | neutral lisa 
eee Oo 
v 7 
y > \ 
cpul cpu2 cpu3 cpu4 cpus cpu6 a cms 
whirl tide default 


Figure 6: Zones and resource pools. 


Security Experience 


Any new systems architecture is rightfully viewed 
with suspicion by security conscious administrators; 
this was true during the project’s development inside 
Sun. In order to better understand the security environ- 
ment in which zones would need to operate, we created 
a non-global zone on an otherwise locked-down sys- 
tem. We then created a /SECRET file in the global zone 
and distributed the root password to the non-global 
zone far and wide within Sun, creating a “zones hack- 
ing” contest. This was extremely successful both for 
the contestants and the zones development team. 


The system was compromised in the first few 
hours, using an exploit that we knew existed, but had 
considered very obscure. We realized that we had 
underestimated our adversaries. As we corrected the 
security problems our hackers found, we learned a lot 
about the sorts of attack techniques and vectors to 
expect. A positive result was that the reduced 


S* prstat <Z 10 


Figure 9: Monitoring Zones Using prstat. The top half of this split view shows the individual processes consuming 
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privileges associated with zone processes meant that 
attackers who managed to read the /SECRET file were 
usually unable to perform other sorts of mischief such 
as writing to /dev/kmem. We responded to the attacks 
by adding new system-level protections that prevent 
all of the exploits found. For example, mounts per- 
formed by a zone transparently have the nodevices 
mount option applied. This prevents using imported 
device files (for example, from an NFS share) as an 
attack vector. 


Other Applications and Future Directions 


In the course of developing this facility we con- 
sidered the many other situations in which technolo- 
gies such as Jails have been deployed. While the pri- 
mary focus of the design is server consolidation, zones 
are well-suited for application developers, and may 
help organizations with large internal software devel- 
opment efforts to provide a multitude of “test sys- 
tems.’’ Many customers we have encountered spend 
substantial sums buying servers solely for this purpose. 


Zones are also a useful solution for web hosting 
and other Internet-facing applications, in which creat- 
ing a large number of application environments (per- 
haps administered by different departments) on modest 
hardware is important. We are also hopeful that 
advanced networking architectures such as PlanetLab 
will eventually include support for zones. At Sun, 
work is underway to prototype a version of Trusted 
Solaris based on the isolation provided by zones. We 
expect other novel uses for zones will emerge as 
researchers, developers, and administrators adopt them. 


Moving forward, we know that networking poses 
key challenges to zones; groups of zones will cooper- 
ate in multi-tier architectures, and administrators will 
expect to be able to cluster, migrate, and failover zones 
from one host to another. Today these technologies are 
the unique domain of virtual machine solutions. 


PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 
12008 60028 191M 167M cpul8s 1 0 0:00:31 1.1% ns-httpd/75 
28163 root 17M 10M sleep 59 O 4:40:37 0.5% ns-httpd/2 
12047 70002 296M 270M sleep 59 0 0:00:06 0.4% oracle/1 
10485 101 190M 101M sleep 59 0 1:37:20 0.2% webservd/82 
14058 root 6928K 5072K sleep 59 O 0:00:00 0.2% sshd/1 

1098 root 1736K 856K sleep 59 O 0:33:00 0.0% tqrtap.v9/1 

994 root 6848K 5512K sleep 59 0 0:23:08 0.0% tqwarp.ext/1 
12049 70002 296M 270M sleep 1 0 0:00:03 0.0% oracle/1 

804 root 4096K 3616K sleep 59 0 0:00:25 0.0% nsed/51 

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE 
2 39 374M 272M 1.6% 4:45:01 2.1% lisa 
1 55 8025M 7217M 45% 0:05:20 0.9% red 
0 56 212M 130M 0.7% 2:28:18 0.2% global 
3 36 463M 211M 1.3% 1:48:55 0.2% neutral 
6 47 940M 372M 232% 0:24:52 0.0% euro 
5 38 330M 246M 1.5% 0:10:47 0.0% end 
Total: 261 processes, 1356 lwps, load averages: 0.12, 0.13, 0.14 


the most CPU cycles. The bottom half shows a view of CPU usage aggregated by zone. 
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One of the strengths of zones is its integration 
with the base operating system. To provide a compre- 
hensive solution, pervasive integration with the wider 
systems management software stack is necessary, and 
will be a major part of our future work. 


Availability 


Solaris Zones, which has been productized under 
the name N/ Grid Containers, is an integrated part of 
the Solaris 10 Operating System. Pre-release versions 
are available as part of the Software Express for Solaris 
Program at http:/www.sun.com/solaris/10. A  clearing- 
house of information about Solaris Zones is available at 
http://www.sun.com/bigadmin/content/zones. Documentation 
is available at http://docs.sun.com. 


Conclusions 


A successful server consolidation must drive 
down both initial and recurring costs and day-to-day 
complexity for all involved. Having less hardware to 
manage is an important goal. However, the ability to 
maintain less software — fewer operating system 
instances — can have an even greater impact on the 
long-term cost reduction realized. The savings in oper- 
ating system licenses and service contracts alone can 
be substantial. The best consolidations also allow a 
site to split the platform administration and application 
administration tasks. This capability allows the IT 
organization to delegate certain work responsibilities 
while maintaining control over the server itself, so 
areas of specialization can be exploited. 


Solutions that create a hierarchy of control on a 
single host without sacrificing observability allow IT 
organizations to act as infrastructure providers who 
can provide compute resources, not just networks and 
SANs. Simultaneously, application expertise can 
remain with the department deploying or developing 
the application. 


Solaris Zones offer the first fully realized facility 
for server consolidation built directly into a commod- 
ity operating system. Zones provides the namespace, 
security and resource isolation needed to drive effec- 
tive consolidation in the real world. 
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